CATH-summer-2017 / django_CATH

A data browser for the CATH database.
0 stars 0 forks source link

Discrepancy between biopython and modeller #1

Closed shouldsee closed 7 years ago

shouldsee commented 7 years ago

To remove the dependency on Modeller, I have been trying to implement the 'packedness' algorithm, using Biopython as my PDBparser. While comparing their behaviour, I found some structures show up as overlapping structures in CATH.

For example, the 5cdzA01 contains 3329 lines, presumably implying 3329 atoms (and also reported by Biopython ), whereas Modeller claims there is only 1669 atoms in the structure, roughly halving the number. To summarise:

biopython modeller ratio
residue_count 217 217 1.0
Atom_count 3329 1669 1.99
nbpair_count 1,177,160 290,380 4.05
conclusion with H-atoms without H-atoms

By inspection, it seems 1669 best describe the structure and I suspect the structure has been deposited as an overlapping structure. Currently I am patching this by detecting atom/res ratio using a cutoff at 11 (each res should not contain more than 11 atom on average), and correct the count by dividing with 2.0 and 4.0 respectively. But I feel this should not be a permanent patch.

@sillitoe @nataliedawson @tonyelewis Any thoughts on possible causes?

BTW, the cutoff for non-bonding interaction is 15.0A, while that for bonding interaction is 3.5A.

UPDATE: After inspecting the Modeller-cleaned PDB 5cdzA01_mod, I found it was the hydrogen atoms causing the difference. Modeller cleaned up H-atom automatically while biopython does not.

shouldsee commented 7 years ago

Resolved with the addition of tst.domutil.pdbutil.sanitise() routine in c33c79c