BorgwardtLab / proteinshake

Protein structure datasets for machine learning.
https://proteinshake.ai
BSD 3-Clause "New" or "Revised" License
101 stars 9 forks source link

Miss match of sequence length, residue_number and residue_type #274

Closed zpengmei closed 4 months ago

zpengmei commented 5 months ago

Hi team,

Thanks for this helpful repository! I am trying to load some protein datasets here, and there seems be to some discrepancy between the atom-level residue type and numbers that assign atoms to residues. Here is my test function:

from proteinshake.datasets import EnzymeCommissionDataset
dataset = EnzymeCommissionDataset()
proteins = dataset.proteins(resolution='atom')
protein_dict = next(proteins)

print(protein_dict['protein']['ID'])
print('seq length:', len(protein_dict['protein']['sequence']))
print(count_segments(protein_dict['atom']['residue_number']).shape)
print(count_segments(protein_dict['atom']['residue_type']).shape)

print(
    count_segments(protein_dict['atom']['residue_number']),
    '\n',
    count_segments(protein_dict['atom']['residue_type'])
)

print(protein_dict['atom']['residue_number'][-20:])
print(protein_dict['atom']['residue_type'][-20:])

Here is the output:

1KG6
seq length: 225
(224,)
(210,)
...
...
[222, 222, 222, 222, 223, 223, 223, 223, 223, 223, 223, 223, 223, 224, 224, 224, 224, 224, 224, 224]
['K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'P', 'P', 'P', 'P', 'P', 'P', 'P']

When looking at the last 20 entries of residue_number and residue type, it seems not matching, like 222 and 223 all refer to K, is this something specific to this dataset?

Thanks! Zihan

cgoliver commented 5 months ago

Thank you @zpengmei ! We are looking into it.

Carlos

cgoliver commented 4 months ago

Hello @zpengmei could you share the code for count_segments so I can fully reproduce your code? Thanks!

cgoliver commented 4 months ago

Hi again @zpengmei ! I took a closer look at your issue.

From what I can tell there doesn't seem to be a problem but since I don't have your full code I can't reproduce exactly what you have.

Taking the same protein ID '1KG6' and checking the length of the residue type, residue number and sequence identities I get the following:

This code:

from proteinshake.datasets import EnzymeCommissionDataset
dataset = EnzymeCommissionDataset()
proteins = dataset.proteins(resolution='atom')
for p in proteins:
    if p['protein']['ID'] == '1KG6':
        print('seq -20:', p['protein']['sequence'][-20:])
        print('seq length:', len(p['protein']['sequence']))
        print('resnum: ', len(p['atom']['residue_number']))
        print('restype: ', len(p['atom']['residue_type']))
        break

This output:

seq -20: QNGCIAAANNSWALYPGKKP
seq length: 225
resnum:  1785
restype:  1785

As you can see the residue number and type lists are of the same length (1785). The repeated 'K' entries in residue type are because the protein has two 'K' amino acids consecutively followed by a 'P'. This is expected to happen naturally.

Unless I am missing something everything appears to be in order.

Feel free to reach out if you have any further issues.

Best, Carlos

zpengmei commented 4 months ago

Hi Carlos,

Thank you so much for your time! Sry for the late reply, I think I missed it there there two 'K'.

Best, Zihan