Closed avivko closed 2 years ago
Hey @avivko thanks for the bug report. Are you running this in a notebook? I could reproduce your error in a JupyterLab notebook. However, running the below as a script worked for me:
import graphein.protein as gp
conf_functions = {"edge_construction_functions": [gp.add_peptide_bonds],
"graph_metadata_functions": [gp.esm_sequence_embedding],
"node_metadata_functions": [gp.amino_acid_one_hot]
}
pg_config = gp.ProteinGraphConfig(**conf_functions)
if __name__ == "__main__":
graph_dict = gp.construct_graphs_mp(
pdb_code_it=['6rew', '7saf'],
config=pg_config,
num_cores=8
)
print(graph_dict)
I'm not an expert on multiprocessing but I think protection with if __name__ == "__main__":
is often important. Perhaps someone more experienced could chime in on a possible resolution for notebook-based execution. The other potential issue I could see is perhaps there are some memory limits imposed by Jupyter. You could try increasing this?
@a-r-j good to know that you were able to run it outside of a notebook env! Seems to work on my machine like that too. However, I am working on a GPU server with >700G RAM and setting --NotebookApp.max_buffer_size=100000000000
didn't seem to help when I ran it in a notebook.
I think this is a jupyter/ipython problem @avivko.
I tried adding the code as a separate module:
# test.py
import graphein.protein as gp
def worker():
conf_functions = {
"edge_construction_functions": [gp.add_peptide_bonds],
"graph_metadata_functions": [gp.esm_sequence_embedding],
"node_metadata_functions": [gp.amino_acid_one_hot],
}
pg_config = gp.ProteinGraphConfig(**conf_functions)
graph_dict = gp.construct_graphs_mp(
pdb_code_it=["6rew", "7saf"], config=pg_config, num_cores=8
)
print(graph_dict)
and then in a notebook running:
from .test import worker
worker()
which also worked.
I think this discussion explains the issue.
EDIT: Importing the mp function explicitly works in a notebook.
import graphein.protein as gp
from graphein.protein.graphs import construct_graphs_mp
conf_functions = {
"edge_construction_functions": [gp.add_peptide_bonds],
"graph_metadata_functions": [gp.esm_sequence_embedding],
"node_metadata_functions": [gp.amino_acid_one_hot],
}
pg_config = gp.ProteinGraphConfig(**conf_functions)
graph_dict = construct_graphs_mp( # Note we are not using `gp.construct_graphs_mp`
pdb_code_it=["6rew", "7saf"], config=pg_config, num_cores=8
)
print(graph_dict)
Describe the bug
Constructing graphs with the multiprocessing method (Construct_graphs_mp) does not work with ESM sequence embedding (i.e. with a config which includes graph_metadata_functions: esm_equence_embedding).
Error message:
To Reproduce Simplified steps to reproduce the behavior:
Running this produces the error above.
Desktop (please complete the following information):