Closed thavlik closed 4 years ago
@thavlik did you ever find out what was going on here?
I have not. Right now I'm preferring the information in the PDB files over what is supplied by ProteinNet.
@thavlik oh noes :( is there a nice place where you can download all the PDBs used by Casp12 in one compressed file?
@lucidrains if you'd like to download the original raw PDBs for CASP12 (just what CASP released, i.e. the ProteinNet test set), you can grab them from the CASP site itself, here.
@thavlik without seeing specific examples of the discrepancy (i.e. a sequence you're getting versus what ProteinNet has), it's difficult to tell what the issue is. My best guess is that you're taking the sequence of the resolved residues in a PDB file--i.e. you're concatenating a bunch of discontiguous segments. This would be incorrect and non-physical. Virtually all PDB files have missing residues and so the sequence in the PDB file has gaps in it. ProteinNet entries have the full sequences, which is the physical object that actually folds, along with mask fields that indicate which residues are missing from the structure, so that you can correctly map the sequence to the structure and also account for the missing residues in the loss function, etc. I am not 100% sure this is what's going on, but that's my best guess like I said. If you can paste a specific sequence from ProteinNet and your pipeline it would be easier to compare the two, and see if you are in fact concatenating discontiguous segments.
@thavlik without seeing specific examples of the discrepancy (i.e. a sequence you're getting versus what ProteinNet has), it's difficult to tell what the issue is. My best guess is that you're taking the sequence of the resolved residues in a PDB file--i.e. you're concatenating a bunch of discontiguous segments. This would be incorrect and non-physical.
Code with example output is in the original issue. Some structures agree (e.g. 2KKP, 2L0E) but some (e.g. 4LHR) do not have the model specified by ProteinNet. In the examples specifying model id=1, is this referencing a modified version of
You bring up a very good point about naively concatenating the residues in the PDB file together. I think we've just been operating on the assumption that a specific ID should have immediate structural correspondence between the training example and PDB file. I will investigate more this weekend.
Thanks again
In the examples specifying model id=1, is this referencing a modified version of ?
After more digging, this turned out to be correct. Many PDBs have waters and other garbage residues included in Model0ChainA. The ProteinNet record will reference a Model1ChainA, which does not exist in the PDB. It can be derived by normalizing Model0ChainA according to the primary sequence / mask. Here is a concrete example with 4JRN_1_A
- it is the PDB primary sequence, followed by ProteinNet's primary sequence and mask:
SELVFEKADSGCVIGKRILAHMQELENSERLDRILTVAAWPPDVPKRFVSVTTGETRTLVRGAPLGSGGFATVYEATDVETNEELAVKVFMSEKEPTDETMLDLQRESSCYRNFSLAKTAKDAQESCRFMVPSDVVMLEGQPASTEVVIGLTTRWVPNYFLLMMRAEADMSKVISWVFGDASVNKSEFGLVVRMYLSSQAIKLVANVQAQGIVHTDIKPANFLLLKDGRLFLGDFGTYRINNSVGRGTPGYEPPERPGITYTFPTDAWQLGITLYCIWCKERPTPADGIWDYLHFADCPSTPELVQDLIRSLLNRDPQKRMLPLQALETAAFKEMDSVVKGAAQNFEQQ
GAHMSELVFEKADSGCVIGKRILAHMQEQIGQPQALENSERLDRILTVAAWPPDVPKRFVSVTTGETRTLVRGAPLGSGGFATVYEATDVETNEELAVKVFMSEKEPTDETMLDLQRESSCYRNFSLAKTAKDAQESCRFMVPSDVVMLEGQPASTEVVIGLTTRWVPNYFLLMMRAEADMSKVISWVFGDASVNKSEFGLVVRMYLSSQAIKLVANVQAQGIVHTDIKPANFLLLKDGRLFLGDFGTYRINNSVGRAIGTPGYEPPERPFQATGITYTFPTDAWQLGITLYCIWCKERPTPADGIWDYLHFADCPSTPELVQDLIRSLLNRDPQKRMLPLQALETAAFKEMDSVVKGAAQNFEQQEHLHTE
----++++++++++++++++++++++++-------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--+++++++++++----++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
Put these all on one line in a text editor, visualize the relationship. The length of the normalized chain should be equal to the number of +
signs in the mask.
HUGE thank you, @alquraishi. ProteinNet is a significant contribution to the community. I'll open another issue if there is any additional confusion.
@thavlik @alquraishi thank you both!
I have some code that fetches mmCIF files for each entry in CASP11 using BioPython. A substantial proportion of examples fail the various checks, though some pass. Many of them are missing the corresponding model, and most that have the given model disagree on primary sequence length. Perhaps there is an obvious explanation for this, and I simply overlooked it. Subtracting one from model_id allows most of the models to be resolved, but many of the primary sequences have significant length mismatch. Most of the files only have the one model, so it's unclear what is exactly is going on here.
Sampled output:
Related issue https://github.com/aqlaboratory/proteinnet/issues/13
Thanks for all the good work.