IDs not matching PDB characteristics (chain number and ID, for instance)

aqlaboratory / proteinnet

Standardized data set for machine learning of protein structure

MIT License

867 stars 132 forks source link

IDs not matching PDB characteristics (chain number and ID, for instance) #23

Closed n4ndoz closed 4 years ago

n4ndoz commented 4 years ago

Hi, first of all, thank you for the huge contribution of putting ProteinNet together great contribution to the comunity indeed.

I am now trying to fetch from PDB the Beta Carbon coordinates for each protein on ProteinNet for my algorithm. The problem is that I cannot intuitively do so via Biopython since some ID's wont inform the chain used on ProteinNet.

'1IQ8_d1iq8b4', '1S2M_d1s2ma2', '1IZ6_d1iz6b2',

Those 3 proteins as an example, have "strange" descriptors in place of chain id and number. What do they mean after all? How could I fetch the same sequence and structure as given in ProteinNet? I am well aware of the problems regarding mmCIF files, and the mismatch of sequences. Is there any way to solve it? I really need to implement this change.

THanks

alquraishi commented 4 years ago

Hi, the descriptors are described on this page.

jessevig commented 4 years ago

I also wanted to thank you for this very useful library. I have the same question as posted above, and unfortunately I couldn't find the answer in the linked documentation, but please let me know if I'm missing something. I found that about 28,000 of the records in the Training 100 file have the unexpected format described above for the ID field, e.g. '1IQ8_d1iq8b4', and the remainder are of the expected format, e.g. '1Z5O_1_A'. Do you have any insight on this? Many thanks in advance. This was for CASP12.

alquraishi commented 4 years ago

Hi @jessevig, thanks for pointing this out. I had missed what the issue was. You're right, there's a second type of identifier, relating to entries from ASTRAL and not the PDB, which is a database of individual protein domains as opposed to full-length proteins. I found that combining both is helpful in training models. In your example, "1IQ8" is the PDB id of the original chain from which the ASTRAL domain is derived, and "d1iq8b4" is the ASTRAL id.

jessevig commented 4 years ago

Got it, thanks for the explanation!