a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
994 stars 125 forks source link

add insertion code to node_id when insertoins are set to True #354

Closed biochunan closed 8 months ago

biochunan commented 8 months ago

Hi, I was trying "Subgraphing to Protein Surface", and encountered the following error - rsa feature was missing from some nodes attributes.

Reproduce error

# basic 
from functools import partial
from pathlib import Path
# graphein 
from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes import rsa 
from graphein.protein.edges import distance as D 
from graphein.protein.config import ProteinGraphConfig, DSSPConfig
from graphein.protein.subgraphs import extract_surface_subgraph

# ---------- graph config ----------
params_to_change = {
    "granularity": "centroids",  # "atom", "CA", "centroids"
    "insertions": True,
    "edge_construction_functions": [
        # graphein.protein.edges.distance.add_peptide_bonds,
        D.add_distance_to_edges,
        D.add_hydrogen_bond_interactions,
        D.add_ionic_interactions,
        D.add_backbone_carbonyl_carbonyl_interactions,
        D.add_salt_bridges,
        # distance 
        partial(D.add_distance_threshold, long_interaction_threshold=4, threshold=4.5),
        ],
    'dssp_config': DSSPConfig(executable="/usr/bin/mkdssp"),
    'graph_metadata_functions': [rsa],
    }
config = ProteinGraphConfig(**params_to_change)
# ---------- input struct ----------
pdb_path = Path('input_pdb_cryst1.pdb')
g = construct_graph(config=config, path=pdb_path, verbose=False)
# ---------- surface subgraph ----------
RSA_THRESHOLD = 0.2
s_g = extract_surface_subgraph(g, RSA_THRESHOLD)

leads to the following error

ProteinGraphConfigurationError: RSA not defined for all nodes (H:TYR:52:A). Please ensure you have                     used graphein.protein.nodes.features.dssp.rsa as a graph                         annotation function.

Because I set insertions to True in config, my nodes ID contains insertion codes. However, when you add node_id column at dssp.py#L139C1-L145C6 which did not consider insertions, which later causes add_dssp_features at line 211 dict(dssp_df[feature]) in which H:TYR:100:A, H:TYR:100:B, etc. are overwritten by the same node_id key H:TYR:100

So adding the following lines (adapted from label_node_id) right after dssp.py#L139C1-L145C6 fixed it

if G.graph['config'].insertions:
    dssp_dict["node_id"] = dssp_dict["node_id"] + ":" + dssp_dict["icode"].apply(str)
    # Replace trailing : for non insertions
    dssp_dict["node_id"] = dssp_dict["node_id"].str.replace(r":\s*$", "", regex=True)
a-r-j commented 8 months ago

Thanks @biochunan, good spot. Could you please open a PR? :)

biochunan commented 8 months ago

Cool, I've opened a PR #355