a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.01k stars 126 forks source link

graph feature `sequence_{chain_id}` contains duplicate residues for atomistic graphs #267

Closed kamurani closed 1 year ago

kamurani commented 1 year ago

I'm trying to extract sequences from graphs loaded from PDB files.

When constructing a residue-granularity graph from the protein, the sequence (stored at g.graph[f'sequence_{chain_id}' is as expected. However, when using atom granularity, the graph's sequence attribute contains repeated residue letters (I am guessing for each atom in the amino acid).

# Construct graphs, one atomistic and one residue level
g_atom = construct_graph(
    pdb_code="6HD6",
    config=ProteinGraphConfig(
        pdb_dir=pdb_dir,
        granularity="atom", # atomistic 
    )
)

g_res = construct_graph(
    pdb_code="6HD6",
    config=ProteinGraphConfig(
        pdb_dir=pdb_dir,
        granularity="CA", # residue
    )
)
c = 'A'
g_res.graph[f'sequence_{c}']
'AMDPSSPNYDKWEMERTDITMKHKLGGGQYGEVYEG ....
g_atom.graph[f'sequence_{c}']
'AAAAAMMMMMMMMDDDDDDDDPPPPPPPSSSSSSSSSSS ....
a-r-j commented 1 year ago

Hey @kamurani good catch! This seems like an easy fix, requiring only another case here: https://github.com/a-r-j/graphein/blob/87985a157623a92f01e7942e048fbdae32e26f14/graphein/protein/graphs.py#L496

Any chance you could make a PR?

kamurani commented 1 year ago

Yep i'll fix and PR when I get the chance. Cheers!