a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

Buggy graph encoding #205

Closed Vincentx15 closed 2 years ago

Vincentx15 commented 2 years ago

Hi folks, thanks for this package !

I'm experiencing bugs when creating graphs from pdb. First of all, if no node_metadata_functions are given (comment below), it defaults to using meiler embeddings which use a file that cannot be found in : graphein/protein/features/nodes/meiler_embeddings.csv. Second of all, some nodes are empty, which raises errors when trying to add edges.

I enclose a small script that raises the errors for a random PDB (please download the pdb data/4f5s.ent.gz).

import graphein.protein as gp
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot

new_funcs = {
    "granularity": 'CA',
    "keep_hets": [False],
    "edge_construction_functions":
        [
         gp.add_disulfide_interactions,
         gp.add_ionic_interactions,
         ],
    "node_metadata_functions": [amino_acid_one_hot]
}
config = ProteinGraphConfig(**new_funcs)
_ = construct_graph(config=config, pdb_path=f"data/4f5s.ent.gz", chain_selection="A")

This should return the following error :

Traceback (most recent call last):
  File "/home/vmallet/projects/foldmine/toto.py", line 48, in <module>
    _ = construct_graph(config=config, pdb_path=f"data/4f5s.ent.gz", chain_selection="A")
  File "/home/vmallet/anaconda3/envs/foldmine/lib/python3.10/site-packages/graphein/protein/graphs.py", line 719, in construct_graph
    g = compute_edges(
  File "/home/vmallet/anaconda3/envs/foldmine/lib/python3.10/site-packages/graphein/protein/graphs.py", line 587, in compute_edges
    func(G)
  File "/home/vmallet/anaconda3/envs/foldmine/lib/python3.10/site-packages/graphein/protein/edges/distance.py", line 315, in add_ionic_interactions
    and G.nodes[r2]["residue_name"] in NEG_AA
KeyError: 'residue_name'

I'm using Ubuntu 20, python 3.10, graphein 1.5.1

Thanks in advance !

a-r-j commented 2 years ago

Hi @Vincentx15 thanks for the bug report - I can reproduce it. I’ll check it out this weekend.

Vincentx15 commented 2 years ago

Hello guys, do you have any updates on this ? Or maybe an idea about how long this issue would take to be solved ?

Thanks a lot for your work !

a-r-j commented 2 years ago

Hey @Vincentx15 - apologies. It’s a super busy time for me. This is very much on my to do list. I will try to get at it later this week if not early next. Thanks for your patience.

Vincentx15 commented 2 years ago

Ok no problem, I understand ! Thanks a lot for the support :)

1511878618 commented 2 years ago

It looks like the meiler_embeddings.csv should have been removed from the package in some recent update, and the code still loads it from the original location. I found the file in the repositorya-r-j/graphein/master/graphein/protein/features/nodes/meiler_embeddings.csv. I downloaded it and put it in my own path in the package's directory:Anaconda/envs/pyg/lib/python3.9/site-packages/graphein/protein/features/nodes and then it looks like works.

This file is supposed to be used to embed amino acids, and currently it looks like is from https://link.springer.com/article/10.1007/s008940100038. Perhaps there could be an api or somethings to allow users to customize this embedding methods instead of using the official default embedding

1511878618 commented 2 years ago

ok, i find how to modif this embedding func.

class graphein.protein.config.ProteinGraphConfig(*, granularity: typing.Union[typing.Literal['N', 'CA', 'C', 'O', 'CB', 'OG', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'OD1', 'ND2', 'CG1', 'CG2', 'CD', 'CE', 'NZ', 'OD2', 'OE1', 'NE2', 'OE2', 'OH', 'NE', 'NH1', 'NH2', 'OG1', 'SD', 'ND1', 'SG', 'NE1', 'CE3', 'CZ2', 'CZ3', 'CH2', 'OXT'], typing.Literal['atom', 'centroids']] = 'CA', keep_hets: bool = False, insertions: bool = False, pdb_dir: pathlib.Path = PosixPath('../examples/pdbs'), verbose: bool = False, exclude_waters: bool = True, deprotonate: bool = False, protein_df_processing_functions: typing.List[typing.Callable] = None, edge_construction_functions: typing.List[typing.Union[typing.Callable, str]] = [<function add_peptide_bonds>], node_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = [<function meiler_embedding>], edge_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = None, graph_metadata_functions: typing.List[typing.Callable] = None, get_contacts_config: graphein.protein.config.GetContactsConfig = None, dssp_config: graphein.protein.config.DSSPConfig = None)[[source]](https://graphein.ai/_modules/graphein/protein/config.html#ProteinGraphConfig)

by setting node_metadata_functions value as embedding function.

a-r-j commented 2 years ago

Thanks @1511878618 that's super helpful. Looks like the embedding file wasn't packaged in the tarball on PyPi. I'll try to reup with the missing file.

1511878618 commented 2 years ago

Ok, and i see some comments on the parameters of function like below image And it seems your example code on doc about this function is not work anymore, I'm wondering whether the older version of graphein is stable? and which version could be more stable........hmmm image image Even, i remove these parameters, it still not works. Also i see some errors on your doc and is it Is it because of these comments happens on function define

a-r-j commented 2 years ago

Hi @1511878618 could you open a separate issue for this. The docs do need to be updated for the Dataset utils and those comments removed. In the example you shared, the problem is that the graph_label_map arg has been replaced with graph_labels (which accepts a list of tensors rather than a dictionary).

1511878618 commented 2 years ago

ok i'll do it

a-r-j commented 2 years ago

@Vincentx15 I figured out the cause of the problem.

The add_disulfide_interactions function isn't checking for node membership in the graph. Your protein seems to contain several inter-chain disulfide interactions which are being found but the B chain CYS residues are being added to the graph. This causes mismatch in shape with the distance matrix and adjacency matrix downstream.

You can see this here:

import graphein.protein as gp
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot

new_funcs = {
    "granularity": 'CA',
    "keep_hets": [False],
    "edge_construction_functions":
        [  
         gp.add_ionic_interactions,
         ],
    "node_metadata_functions": [amino_acid_one_hot]
}
config = ProteinGraphConfig(**new_funcs)

g = construct_graph(config=config, pdb_code="4f5s", chain_selection="A")
print(len(g.nodes)) #583 nodes

gp.add_disulfide_interactions(g)
print(len(g.nodes)) #617 nodes
print(g.nodes) # here we can see the CYS residues on the B chain have been added to the graph.

It's a pretty straightforward fix & I'll push an update ASAP :)

a-r-j commented 2 years ago

And until I get around to the PR, here's a fixed disulfide function. I've tested in the example you used above and it works as expected :)

import networkx as nx
import pandas as pd
from typing import Optional

from graphein.protein.resi_atoms import DISULFIDE_ATOMS, DISULFIDE_RESIS
from graphein.protein.edges.distance import add_interacting_resis, get_interacting_atoms, compute_distmat, filter_dataframe

import loguru as log

def add_disulfide_interactions(
    G: nx.Graph, rgroup_df: Optional[pd.DataFrame] = None
):
    """
    Find all disulfide interactions between CYS residues
    (:const:`~graphein.protein.resi_atoms.DISULFIDE_RESIS`,
    :const:`~graphein.protein.resi_atoms.DISULFIDE_ATOMS`).
    Criteria: sulfur atom pairs are within 2.2A of each other.
    :param G: networkx protein graph
    :type G: nx.Graph
    :param rgroup_df: pd.DataFrame containing rgroup data, defaults to ``None``,
        which retrieves the df from the provided nx graph.
    :type rgroup_df: pd.DataFrame, optional
    """
    # Check for existence of at least two Cysteine residues
    residues = [d["residue_name"] for _, d in G.nodes(data=True)]
    if residues.count("CYS") < 2:
        log.debug(
            f"{residues.count('CYS')} CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues."
        )
        return

    if rgroup_df is None:
        rgroup_df = G.graph["rgroup_df"]
    disulfide_df = filter_dataframe(
        rgroup_df, "residue_name", DISULFIDE_RESIS, True
    )
    disulfide_df = filter_dataframe(
        disulfide_df, "atom_name", DISULFIDE_ATOMS, True
    )
    disulfide_df = filter_dataframe(disulfide_df, "node_id", list(G.nodes), True) # this line is the fix

    distmat = compute_distmat(disulfide_df)
    interacting_atoms = get_interacting_atoms(2.2, distmat)
    add_interacting_resis(G, interacting_atoms, disulfide_df, ["disulfide"])
Vincentx15 commented 2 years ago

Hey guys, thanks a lot for the fix. Splitting in chains solved it for me !

Best, Vincent

a-r-j commented 2 years ago

This should be fully resolved in 1.5.2 (pip install graphein==1.5.2). Closing the issue - let me know if you run into any trouble :)