a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

Construct_graphs_mp with esm_equence_embedding throws error #196

Closed avivko closed 2 years ago

avivko commented 2 years ago

Describe the bug

Constructing graphs with the multiprocessing method (Construct_graphs_mp) does not work with ESM sequence embedding (i.e. with a config which includes graph_metadata_functions: esm_equence_embedding).

Error message:

---------------------------------------------------------------------------
BrokenProcessPool                         Traceback (most recent call last)
Input In [20], in <cell line: 12>()
      4 conf_functions = {"edge_construction_functions": [gp.add_peptide_bonds],
      5                   "graph_metadata_functions": [gp.esm_sequence_embedding],       
      6                   "node_metadata_functions": [gp.amino_acid_one_hot]
      7                  }  
      9 pg_config = gp.ProteinGraphConfig(**conf_functions)
---> 12 graph_dict = gp.construct_graphs_mp(
     13     pdb_code_it=['6rew', '7saf'],
     14     config=pg_config,
     15     num_cores=8
     16     )

File ~/graphein/graphein/protein/graphs.py:851, in construct_graphs_mp(pdb_code_it, pdb_path_it, uniprot_id_it, chain_selections, model_indices, config, num_cores, return_dict, out_path)
    846     model_indices = [1] * len(pdbs)
    848 constructor = partial(_mp_graph_constructor, source=source, config=config)
    850 graphs = list(
--> 851     process_map(
    852         constructor,
    853         [
    854             (pdb, chain_selections[i], model_indices[i])
    855             for i, pdb in enumerate(pdbs)
    856         ],
    857         max_workers=num_cores,
    858     )
    859 )
    860 if out_path is not None:
    861     [
    862         nx.write_gpickle(
    863             g, str(f"{out_path}/" + f"{g.graph['name']}.pickle")
    864         )
    865         for g in graphs
    866     ]

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/site-packages/tqdm/contrib/concurrent.py:130, in process_map(fn, *iterables, **tqdm_kwargs)
    128     tqdm_kwargs = tqdm_kwargs.copy()
    129     tqdm_kwargs["lock_name"] = "mp_lock"
--> 130 return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/site-packages/tqdm/contrib/concurrent.py:76, in _executor_map(PoolExecutor, fn, *iterables, **tqdm_kwargs)
     74     map_args.update(chunksize=chunksize)
     75 with PoolExecutor(**pool_kwargs) as ex:
---> 76     return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/site-packages/tqdm/notebook.py:258, in tqdm_notebook.__iter__(self)
    256 try:
    257     it = super(tqdm_notebook, self).__iter__()
--> 258     for obj in it:
    259         # return super(tqdm...) will not catch exception
    260         yield obj
    261 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/site-packages/tqdm/std.py:1195, in tqdm.__iter__(self)
   1192 time = self._time
   1194 try:
-> 1195     for obj in iterable:
   1196         yield obj
   1197         # Update and possibly print the progressbar.
   1198         # Note: does not call self.update(1) for speed optimisation.

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/concurrent/futures/process.py:484, in _chain_from_iterable_of_lists(iterable)
    478 def _chain_from_iterable_of_lists(iterable):
    479     """
    480     Specialized implementation of itertools.chain.from_iterable.
    481     Each item in *iterable* should be a list.  This function is
    482     careful not to keep references to yielded objects.
    483     """
--> 484     for element in iterable:
    485         element.reverse()
    486         while element:

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/concurrent/futures/_base.py:619, in Executor.map.<locals>.result_iterator()
    616 while fs:
    617     # Careful not to keep a reference to the popped future
    618     if timeout is None:
--> 619         yield fs.pop().result()
    620     else:
    621         yield fs.pop().result(end_time - time.monotonic())

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/concurrent/futures/_base.py:444, in Future.result(self, timeout)
    442     raise CancelledError()
    443 elif self._state == FINISHED:
--> 444     return self.__get_result()
    445 else:
    446     raise TimeoutError()

File /glusterfs/dfs-gfs-dist/kormanav/miniconda3/envs/graphein-gpu/lib/python3.8/concurrent/futures/_base.py:389, in Future.__get_result(self)
    387 if self._exception:
    388     try:
--> 389         raise self._exception
    390     finally:
    391         # Break a reference cycle with the exception in self._exception
    392         self = None

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

To Reproduce Simplified steps to reproduce the behavior:

import graphein.protein as gp

conf_functions = {"edge_construction_functions": [gp.add_peptide_bonds],
                  "graph_metadata_functions": [gp.esm_sequence_embedding],       
                  "node_metadata_functions": [gp.amino_acid_one_hot]
                 }  

pg_config = gp.ProteinGraphConfig(**conf_functions)

graph_dict = gp.construct_graphs_mp(
    pdb_code_it=['6rew', '7saf'],
    config=pg_config,
    num_cores=8
    )

Running this produces the error above.

Desktop (please complete the following information):

a-r-j commented 2 years ago

Hey @avivko thanks for the bug report. Are you running this in a notebook? I could reproduce your error in a JupyterLab notebook. However, running the below as a script worked for me:


import graphein.protein as gp

conf_functions = {"edge_construction_functions": [gp.add_peptide_bonds],
                  "graph_metadata_functions": [gp.esm_sequence_embedding],       
                  "node_metadata_functions": [gp.amino_acid_one_hot]
                 }  

pg_config = gp.ProteinGraphConfig(**conf_functions)

if __name__ == "__main__":
    graph_dict = gp.construct_graphs_mp(
        pdb_code_it=['6rew', '7saf'],
        config=pg_config,
        num_cores=8
        )
    print(graph_dict)

I'm not an expert on multiprocessing but I think protection with if __name__ == "__main__": is often important. Perhaps someone more experienced could chime in on a possible resolution for notebook-based execution. The other potential issue I could see is perhaps there are some memory limits imposed by Jupyter. You could try increasing this?

avivko commented 2 years ago

@a-r-j good to know that you were able to run it outside of a notebook env! Seems to work on my machine like that too. However, I am working on a GPU server with >700G RAM and setting --NotebookApp.max_buffer_size=100000000000 didn't seem to help when I ran it in a notebook.

a-r-j commented 2 years ago

I think this is a jupyter/ipython problem @avivko.

I tried adding the code as a separate module:

# test.py
import graphein.protein as gp

def worker():
    conf_functions = {
        "edge_construction_functions": [gp.add_peptide_bonds],
        "graph_metadata_functions": [gp.esm_sequence_embedding],
        "node_metadata_functions": [gp.amino_acid_one_hot],
    }
    pg_config = gp.ProteinGraphConfig(**conf_functions)
    graph_dict = gp.construct_graphs_mp(
        pdb_code_it=["6rew", "7saf"], config=pg_config, num_cores=8
    )
    print(graph_dict)

and then in a notebook running:

from .test import worker

worker()

which also worked.

I think this discussion explains the issue.

EDIT: Importing the mp function explicitly works in a notebook.

import graphein.protein as gp
from graphein.protein.graphs import construct_graphs_mp

conf_functions = {
    "edge_construction_functions": [gp.add_peptide_bonds],
    "graph_metadata_functions": [gp.esm_sequence_embedding],
    "node_metadata_functions": [gp.amino_acid_one_hot],
}
pg_config = gp.ProteinGraphConfig(**conf_functions)
graph_dict = construct_graphs_mp(    # Note we are not using `gp.construct_graphs_mp`
    pdb_code_it=["6rew", "7saf"], config=pg_config, num_cores=8
)
print(graph_dict)