Closed sillitoe closed 6 years ago
Thanks for this excellent report.
I've just pushed 375fad015700fa969c2677fec6f06045b685e047, which improves the handling of this situation so that it now throws an informative exception and stops rather than just crashing:
2017-12-22 11:21:06.077030 [cath-superpose|warning] Ignoring residue A:55 whilst extracting a protein structure from PDB file data because it doesn't have all of N, CA and C atoms (or because of duplicate residue IDs)
2017-12-22 11:21:06.084788 [cath-superpose|error ] Problem building alignment (and spanning tree) : Cannot read FASTA alignment [Whilst aligning a sequence string to a list of amino acids, could not find match for 'H' at character 62 in sequence (context in sequence: "KII--KHHHH*H*--------")]
...and I've put this in my ~/bin
.
(I've also pushed ea9f1c7bc471b8e441543012dbdc860674bb3df8 with a TODO comment to consider making this code more robust, so that if it manages to match most residues, it warns and proceeds with what it's got).
But I also think it's unreasonable for cath-tools
to reject a sequence that contains a residue for which it's previously seen records in the PDB. I'll look into that.
I believe this is now resolved in 5284fec26e041f3054daa9d350248e7d64b277c3. Please let me know if you have more trouble.
Running:
Gives:
Details appended below - will make more sense if you have local access to...
Looks like segfault is due to a mismatch between ATOM records and FASTA alignment
Comparing the sequence
1oksA00
based on two different alignments (from SSAP and CORA)Looks like the CORA sequence has an extra residue
H
on the end.Looking at the ATOM records:
The final residue only has one atom.
So the fix is probably going to be in my code, though it might be nice to avoid the segfault.