a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.02k stars 131 forks source link

keep_het parameter not working #373

Closed davidkastner closed 6 months ago

davidkastner commented 6 months ago

Describe the bug The config parameter keep_hets is currently not working. It seems keep_hets was recently updated from bool type to a list of strings, where it contains the specific residue name of a HETATM residue such as keep_hets=["HOH"]. However, after updating the parameter, it doesn't include the specified residues in the graph. The tutorial installed the newest version of Graphein-1.7.6 and I haven't had a chance to back test the other versions to see when the keep_het functionality broke but will updated this ticket when I have a chance.

To Reproduce This can be seen in the tutorial example of 3EIY, which contains 112 waters. However, when we run:

from graphein.protein.graphs import construct_graph
config = ProteinGraphConfig(keep_hets=["HOH"])
g = construct_graph(config=config, pdb_code="3eiy")

None of the waters are included in the graph. If we print the nodes with g.nodes() and look that the last residues we see that no waters were included:

['A:ALA:171', 'A:ASN:172', 'A:PHE:173', 'A:LYS:174', 'A:LYS:175']

Expected behavior If I understand correctly, the expected behavior of keep_hets would be for the waters to now be included in the graph representation.

Screenshots Here is a screen shot of the representation of 3EIY, where we can see only the protein residues included.

Screenshot 2024-03-12 at 11 23 03 AM

Desktop (please complete the following information): This reproduced using the google Collab notebook with graphein-1.7.6 installed. No other modification where made to the tutorial.

a-r-j commented 6 months ago

Hi @davidkastner, good catch. This is a slightly tricky issue to resolve.

I think the omission of the water nodes comes from here: https://github.com/a-r-j/graphein/blob/6dae5ff114a40410566f6fea4e558b2b9a6ba580/graphein/protein/graphs.py#L199

Where we select on CA atoms to count as nodes. I think if you use granularity="atom" the waters will be present.

For heteroatoms it can be tricky to consistently and universally define what the coarsened node should be. I think a good heuristic could be the CoM for the ligand for coarsened graphs. One work-around would be to write your own hetatm_df_processing_func to manipulate the hetatm df to contain a representative "CA"

We looked into this quite extensively for Protein-Ligands graphs (see #164 , mainly here: https://github.com/a-r-j/graphein/blob/d81fc2f77b3562f61f70f257ddf509d5102b8bf6/graphein/protein_ligand/graphs.py).

What's your application? Using graphein.protein.tensor.data.Protein should work reliably if it's ML-based.

davidkastner commented 6 months ago

Hi @a-r-j. I see the problem and agree it would be challenging to generalize! For my purposes, the atom representation will work well as I am building graphs for QM cluster models extracted from proteins. As the QM cluster models are small in size, the extra information afforded by the atom representation will be useful. I appreciate your response and will close the issue as resolved but hopefully it will be a useful point of reference for others.