UCLOrengoGroup / cath-tools

Protein structure comparison tools such as SSAP and SNAP
http://cath-tools.readthedocs.io
GNU General Public License v3.0
57 stars 14 forks source link

cath-superpose segfaults on mismatching residues in alignment #55

Closed sillitoe closed 6 years ago

sillitoe commented 6 years ago

Running:

$ cath-superpose --fasta-aln-infile=data/cora.fa --pdb-infile=$PDB/1cukA03 --pdb-infile=$PDB/1oaiA00 --pdb-infile=$PDB/1oksA00

Gives:

2017-12-21 21:42:37.024806 [cath-superpose|warning] Ignoring residue A:55 whilst extracting a protein structure from PDB file data because it doesn't have all of N, CA and C atoms (or because of duplicate residue IDs)
./run.sh: line 11: 23156 Segmentation fault      (core dumped) cath-superpose --fasta-aln-infile=data/cora.fa --pdb-infile=$PDB/1cukA03 --pdb-infile=$PDB/1oaiA00 --pdb-infile=$PDB/1oksA00

Details appended below - will make more sense if you have local access to...

/cath/homes/ucbcisi/git/cath-superpose-segfault

Looks like segfault is due to a mismatch between ATOM records and FASTA alignment

Comparing the sequence 1oksA00 based on two different alignments (from SSAP and CORA)

$ grep -A1 1oks data/*.fa
data/cora.fa:>domain|1oksA00|4_1_0
data/cora.fa------ASRSVIRSIIKSSRLEEDRKRYLTLLDDIKGANDLA---KFH-QLVKII--KHHHHH--------
--
data/ssap.fa:>domain|1oksA00|3_3_0
data/ssap.fa-ASRSVIRSIIKSS--RLEEDRKRYLTLLDDIKGANDLAKFHQLVKIIKHHHH

Looks like the CORA sequence has an extra residue H on the end.

Looking at the ATOM records:

$ grep ATOM /cath/data/v3_3_0/pdb/1oksA00 | tail -n 5
ATOM    436  ND1 HIS A  54      16.809  24.511   0.905  1.00 72.88
ATOM    437  CD2 HIS A  54      16.431  22.362   0.766  1.00 72.22
ATOM    438  CE1 HIS A  54      17.652  24.022   0.011  1.00 77.93
ATOM    439  NE2 HIS A  54      17.443  22.720  -0.093  1.00 77.13
ATOM    440  N   HIS A  55      13.163  25.810   3.812  1.00 46.88

The final residue only has one atom.

So the fix is probably going to be in my code, though it might be nice to avoid the segfault.

$ which cath-superpose
/opt/local/apps/CentOS6-x86_64/bin/cath-superpose

$ cath-superpose -v
============
cath-superpose v0.15.3-0-g45dc1b5 [2017-09-20]
============

Superpose protein structures using an existing alignment

Build
-----
   Sep 20 2017 21:17:59
   GNU C++ version 4.9.2 20150212 (Red Hat 4.9.2-6)
   GNU libstdc++ version 20150212
   Boost 1_60
tonyelewis commented 6 years ago

Thanks for this excellent report.

I've just pushed 375fad015700fa969c2677fec6f06045b685e047, which improves the handling of this situation so that it now throws an informative exception and stops rather than just crashing:

2017-12-22 11:21:06.077030 [cath-superpose|warning] Ignoring residue A:55 whilst extracting a protein structure from PDB file data because it doesn't have all of N, CA and C atoms (or because of duplicate residue IDs)
2017-12-22 11:21:06.084788 [cath-superpose|error  ] Problem building alignment (and spanning tree) : Cannot read FASTA alignment [Whilst aligning a sequence string to a list of amino acids, could not find match for 'H' at character 62 in sequence (context in sequence: "KII--KHHHH*H*--------")]

...and I've put this in my ~/bin.

(I've also pushed ea9f1c7bc471b8e441543012dbdc860674bb3df8 with a TODO comment to consider making this code more robust, so that if it manages to match most residues, it warns and proceeds with what it's got).


But I also think it's unreasonable for cath-tools to reject a sequence that contains a residue for which it's previously seen records in the PDB. I'll look into that.

tonyelewis commented 6 years ago

I believe this is now resolved in 5284fec26e041f3054daa9d350248e7d64b277c3. Please let me know if you have more trouble.