bjornwallner / DockQ

DockQ is a single continuous quality measure for Protein, Nucleic Acids and Small Molecule Docking Models
MIT License
212 stars 49 forks source link

Chain size limitations? #10

Closed gtauriello closed 5 months ago

gtauriello commented 2 years ago

I am having trouble running DockQ with moderately large homo-dimers. Is this a known issue for the tools here to fail when there are many residues in a chain?

I ran DockQ successfully for most of the models and references in a given benchmark set but the largest files failed. The smallest file where I could observe a failure was when comparing the attached (4u59_2_files.zip) 4u59_2_model.pdb with 4u59_2.pdb (i.e. simple call ./DockQ.py 4u59_2_model.pdb 4u59_2.pdb).

Here the model covers more than the reference and so ./DockQ.py 4u59_2.pdb 4u59_2.pdb works (3076 residues in 4u59_2) while ./DockQ.py 4u59_2_model.pdb 4u59_2_model.pdb fails (3294 residues in 4u59_2_model).

The traceback of the error looks as follows when run with Python 3:

Traceback (most recent call last):
  File ".../DockQ.py", line 730, in <module>
    main()    
  File ".../DockQ.py", line 658, in main
    info=calc_DockQ(model,native,use_CA_only=use_CA_only,capri_peptide=capri_peptide) #False):
  File ".../DockQ.py", line 112, in calc_DockQ
    fnat_out = os.popen(cmd_fnat).read()
  File ".../python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 852: invalid continuation byte

and as follows with Python 2:

Traceback (most recent call last):
  File "../DockQ.py", line 730, in <module>
    main()    
  File "../DockQ.py", line 658, in main
    info=calc_DockQ(model,native,use_CA_only=use_CA_only,capri_peptide=capri_peptide) #False):
  File "../DockQ.py", line 118, in calc_DockQ
    assert fnat!=-1, "Error running cmd: %s\n" % (cmd_fnat)
AssertionError: Error running cmd: .../fnat 4u59_2_model.pdb 4u59_2_model.pdb 5 -all

The latter error indicates an issue in the fnat binary which indeed produces wrong looking characters before segfaulting. Here the last few lines of the output of fnat 4u59_2_model.pdb 4u59_2_model.pdb 5:

NATIVE: 25259?b 1629C 0.107644
Fnat 85805 13756 6.237642
Fnonnat -72049 13756 -5.237642
Segmentation fault

As an additional note I observed plenty of compile-time warnings when compiling using GCC 10.3.0 and it may be worth checking them as they could be indicative of some overflows or so...

The specific files do not matter and I could reproduce the same failures when downloading moderately large homo-dimers from the PDB (e.g. https://files.rcsb.org/download/6EQO.pdb).

Given that large complexed and multi-domain proteins are interesting and challenging prediction problems it would be good to fix the issue described here to be able to apply DockQ on benchmarks for such problems.