aqlaboratory / proteinnet

Standardized data set for machine learning of protein structure
MIT License
867 stars 132 forks source link

Unmasked zeroed tertiary data in text-based CASP7 #30

Open memoryleak47 opened 2 years ago

memoryleak47 commented 2 years ago

When implementing an RGN for a university project, we stumbled upon a few apparant irregularities in the text-based CASP7 dataset provided here. That is, quite a few atoms in the tertiary data were positioned at (0,0,0) even though the mask was +, i.e. the atom was considered to be 'valid'.

Example taken from CASP7/validation.

[ID]
70#1MLI_1_A
[PRIMARY]
...
[EVOLUTIONARY]
...
[TERTIARY]
0   1562.5  0   0   1571.2  0   0   1458.2  0   0   1371.3  0   0   1078.5  0   0   953.8 ...
0   1363.   0   0   1492.5  0   0   1226.9  0   0   1303.3  0   0   1229.4  0   0   1255.1 ...
0   4743.1  0   0   4394.3  0   0   4152.2  0   0   3792.3  0   0   3597.2  0   0   3246.3 ...
[MASK]
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++                           

In this example two thirds of the atoms are positioned at (0, 0, 0). Is this a bug, or am I simply misinterpreting the given data somehow?

Thanks in advance!

jonathanking commented 2 years ago

I believe there are a handful of structures that only contain alpha-carbon information. If you inspect the RCSB entry, you'll find this is the case for this structure. You can also see the pattern of (N, Calpha, C) in the tertiary data, where N and C are missing.

Hopefully Mohammed can correct me if I am mistaken, but I hope my comment can help for now.

memoryleak47 commented 2 years ago

I see! So sometimes individual atoms can be missing in spite of a "+" mask.

But can we assume that each (0, 0, 0) atom is in fact just missing data? Or is there some other procedure to know which atoms are valid?

jonathanking commented 2 years ago

Correct. I believe the mask is on the residue level and not the atom level.

Yes, I would think it is reasonable to assume that and it is most likely described somewhere in the documentation here. On Mar 18, 2022, 2:02 PM -0700, memoryleak47 @.***>, wrote:

I see! So sometimes individual atoms can be missing in spite of a "+" mask. But can we assume that each (0, 0, 0) atom is in fact just missing data? Or maybe is there some link where I could read those details up? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

memoryleak47 commented 2 years ago

Correct. I believe the mask is on the residue level and not the atom level.

Ah, true!

If I'm not overlooking something, it doesn't seem to be mentioned in the documentation here https://github.com/aqlaboratory/proteinnet/blob/master/docs/proteinnet_records.md nor anywhere else on this github page.

Is there some external resource where I could read that up?

jonathanking commented 2 years ago

I'm afraid I don't have more information. I'm not affiliated with ProteinNet, though I use the provided data and dataset splits in my own research.