a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

local pdb + dssp = error #171

Closed avivko closed 2 years ago

avivko commented 2 years ago

Describe the bug When using a local pdb file and a ProteinGraphConfig that contains a DSSPConfig, DSSPConfig always tries to download the pdb file, even if it is in the pdb_dir, and throws an error if the name of the pdb file is not a PDB ID.

To Reproduce

from functools import partial
from graphein.protein.subgraphs import extract_subgraph_from_chains
from graphein.protein.config import ProteinGraphConfig, DSSPConfig
from graphein.protein.features.nodes.amino_acid import expasy_protein_scale, meiler_embedding
from graphein.protein.features.nodes import asa, rsa
from graphein.protein.edges.distance import (add_peptide_bonds,
                                             add_hydrogen_bond_interactions,
                                             add_disulfide_interactions,
                                             add_ionic_interactions,
                                             add_aromatic_interactions,
                                             add_aromatic_sulphur_interactions,
                                             add_cation_pi_interactions
                                            )

conf_functions = {"edge_construction_functions": [add_peptide_bonds,
                                                  add_aromatic_interactions,
                                                  add_hydrogen_bond_interactions,
                                                  add_disulfide_interactions,
                                                  add_ionic_interactions,
                                                  add_aromatic_sulphur_interactions,
                                                  add_cation_pi_interactions],
                  "graph_metadata_functions": [asa, rsa],                                        # Add ASA and RSA features.
                  "node_metadata_functions": [meiler_embedding,partial(expasy_protein_scale, add_separate=True)], # Add expasy features (partial: each feature is added under a separate key)
                  "dssp_config":DSSPConfig(),                                                    # Add DSSP config in order to compute ASA and RSA.
                  "pdb_dir": '/vol/tmp/kormanav/pdb_dir'
                 }        
batch_config = ProteinGraphConfig(**conf_functions)

construct_graph(config=batch_config, pdb_path="/vol/tmp/kormanav/6rew_copy.pdb")

Results in:

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 1365 total nodes
INFO:graphein.protein.edges.distance:Found: 234 aromatic-aromatic interactions
INFO:graphein.protein.edges.distance:Found 532 hbond interactions.
INFO:graphein.protein.edges.distance:Found 55 hbond interactions.
DEBUG:graphein.protein.edges.distance:0 CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues.
INFO:graphein.protein.edges.distance:Found 11848 ionic interactions.
Downloading PDB structure '6rew_copy'...
ERROR:graphein.protein.utils:PDB file 6rew_copy not found and no replacement                       structure found in obsolete lookup.
Desired structure doesn't exists
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [20], in <cell line: 39>()
     37 graph_list = []
     38 y_list = []
---> 39 construct_graph(config=batch_config, pdb_path="/vol/tmp/kormanav/6rew_copy.pdb")
     40 """for idx, pdb_p in enumerate(tqdm(pdb_paths)):
     41     print('pdb:', pdb_p)
     42     try:
   (...)
     46         print(f'PDB #{idx}: processing error!')
     47         pass"""

File ~/repos_dev/graphein/graphein/protein/graphs.py:614, in construct_graph(config, pdb_path, pdb_code, chain_selection, df_processing_funcs, edge_construction_funcs, edge_annotation_funcs, node_annotation_funcs, graph_annotation_funcs)
    612 # Annotate additional graph metadata
    613 if config.graph_metadata_functions is not None:
--> 614     g = annotate_graph_metadata(g, config.graph_metadata_functions)
    616 # Annotate additional edge metadata
    617 if config.edge_metadata_functions is not None:

File ~/repos_dev/graphein/graphein/utils/utils.py:69, in annotate_graph_metadata(G, funcs)
     58 """
     59 Annotates graph with graph-level metadata
     60 
   (...)
     66 :rtype: nx.Graph
     67 """
     68 for func in funcs:
---> 69     func(G)
     70 return G

File ~/repos_dev/graphein/graphein/protein/features/nodes/dssp.py:239, in asa(G)
    230 def asa(G: nx.Graph) -> nx.Graph:
    231     """
    232     Adds ASA of each residue in protein graph as calculated by DSSP.
    233 
   (...)
    237     :rtype: nx.Graph
    238     """
--> 239     return add_dssp_feature(G, "asa")

File ~/repos_dev/graphein/graphein/protein/features/nodes/dssp.py:174, in add_dssp_feature(G, feature)
    144 """
    145 Adds add_dssp_feature specified amino acid feature as calculated
    146 by DSSP to every node in a protein graph
   (...)
    171 :rtype: nx.Graph
    172 """
    173 if "dssp_df" not in G.graph:
--> 174     G = add_dssp_df(G, G.graph["config"].dssp_config)
    176 config = G.graph["config"]
    177 dssp_df = G.graph["dssp_df"]

File ~/repos_dev/graphein/graphein/protein/features/nodes/dssp.py:122, in add_dssp_df(G, dssp_config)
    120 # Check for existence of pdb file. If not, download it.
    121 if not os.path.isfile(config.pdb_dir / pdb_id):
--> 122     pdb_file = download_pdb(config, pdb_id)
    123 else:
    124     pdb_file = config.pdb_dir + pdb_id + ".pdb"

File ~/repos_dev/graphein/graphein/protein/utils.py:100, in download_pdb(config, pdb_code)
     95         log.error(
     96             f"PDB file {pdb_code} not found and no replacement \
     97                   structure found in obsolete lookup."
     98         )
     99 # Rename file to .pdb from .ent
--> 100 os.rename(
    101     config.pdb_dir / f"pdb{pdb_code}.ent",
    102     config.pdb_dir / f"{pdb_code}.pdb",
    103 )
    105 # Assert file has been downloaded
    106 assert any(pdb_code in s for s in os.listdir(config.pdb_dir))

FileNotFoundError: [Errno 2] No such file or directory: '/vol/tmp/kormanav/pdb_dir/pdb6rew_copy.ent' -> '/vol/tmp/kormanav/pdb_dir/6rew_copy.pdb'

Whereby this also happens when the file is placed in the pdb_dir:

ls /vol/tmp/kormanav/pdb_dir
3eiy.pdb
6rew_copy.pdb   
6rew.pdb

Expected behavior

Desktop (please complete the following information):

avivko commented 2 years ago

@a-r-j do you want to fix it or should I have a go at it?

a-r-j commented 2 years ago

I've penciled in finishing a few pending PRs today. If you want it fast, I'd appreciate a contribution :)

Should be an easy fix: I think it's just missing a .pdb extension in this line: https://github.com/a-r-j/graphein/blob/1c87afdcba4ce45c17242c8e74d2a1d60d6b9076/graphein/protein/features/nodes/dssp.py#L121

avivko commented 2 years ago

Was just about to work on the hotfix, but I see @OliverT1 was quicker than I was :) Hopefully the PR will be merged soon!