a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

Apply same preprocessing as graphs to downloaded PDB in DSSP calculation #111

Closed diamondspark closed 2 years ago

diamondspark commented 2 years ago

In the method Protein/features/nodes/dssp/add_dssp_df(); Biopython's DSSP calculation is invoked on downloaded, unprocessed PDB. The resulting DSSP dataframe sometimes has a different number of residues than the protein graph generated as in #98. I believe the same preprocessing steps that are performed in Protein/graph.py are needed in Protein/features/nodes/dssp/add_dssp_df()

E.g. PDB: 1utm, 2qrh

Thank you!

a-r-j commented 2 years ago

Hi @diamondspark this is somewhat intentional. The graph attribute (g.graph[“dssp_df”]) is mostly intended to be used for traceability/record keeping. The actual features should be available as node attributes.

for n, d in g.nodes(data=True):
    print(d.keys())
    break

If you require a dataframe, there is a function for retrieving node features as a dataframe.

Do you have very strong feelings about this? I wonder what the best way to allow control of this is. Perhaps in the DSSPConfig object we can have a parameter controlling whether or not to filter the dataframe.

With respect to applying DSSP to the unprocessed PDB - I think this is the correct thing to do. I don’t think it will run correctly on, for example, a CA-only structure.

a-r-j commented 2 years ago

Hey @diamondspark any comments on this? If not I will close.

diamondspark commented 2 years ago

Hi @a-r-j I haven't had time to get back to this but I think g.nodes(data=True) should solve my purpose. Thank you once again.