Closed ballaneypranav closed 1 year ago
Hi @ballaneypranav
Could you add a line raise e
to this point in the inference script https://github.com/gcorso/DiffDock/blob/f8d67b5b2b30b72eedd010e76accc1a306ee605f/inference.py#L204
and rerun with your input file, so that it will print the full stack trace of the error and we may understand where the source of the problem is?
Thank you for your response. DiffDock works as expected on a CPU, but when I try to use a GPU, I see a warning that codegen failed and a fallback path was taken. This is the full output:
(diffdock) Singularity> python -m inference --protein_ligand_csv data/protein_ligand.csv --out_dir data/output --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:14<00:00, 2.72it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:24<00:00, 2.38it/s]
/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/utils/torus.py:38: RuntimeWarning: invalid value encountered in divide
score_ = grad(x, sigma[:, None], N=100) / p_
Reading molecules and generating local structures with RDKit
1it [00:00, 13.51it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.17it/s]
loading data from memory: data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings2838843221/heterographs.pkl
Number of complexes: 1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 9.061843872070312, std 0.0, max 9.061843872070312
distance protein-mol: mean 82.87322235107422, std 0.0, max 82.87322235107422
rmsd matching: mean 0.0, std 0.0, max 0
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
Reading molecules and generating local structures with RDKit
1it [00:00, 18.67it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.02it/s]
loading data from memory: data/cache_torsion_allatoms/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings2838843221/heterographs.pkl
Number of complexes: 1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 8.859946250915527, std 0.0, max 8.859946250915527
distance protein-mol: mean 82.7789077758789, std 0.0, max 82.7789077758789
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1. 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35
0.3 0.25 0.2 0.15 0.1 0.05]
Size of test dataset: 1
0it [00:00, ?it/s]/opt/conda/envs/diffdock/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
(Triggered internally at /opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/manager.cpp:237.)
sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
1it [00:58, 58.41s/it]
Failed for 0 complexes
Skipped 0 complexes
Results are in data/output
After setting PYTORCH_NVFUSER_DISABLE=fallback
and export PYTORCH_JIT_LOG_LEVEL=manager.cpp
and adding raise e
to inference.py
, I see the following output:
(diffdock) Singularity> python -m inference --protein_ligand_csv data/protein_ligand.csv --out_dir data/output --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:14<00:00, 2.71it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:24<00:00, 2.38it/s]
/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/utils/torus.py:38: RuntimeWarning: invalid value encountered in divide
score_ = grad(x, sigma[:, None], N=100) / p_
Reading molecules and generating local structures with RDKit
1it [00:00, 15.58it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.05it/s]
loading data from memory: data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings2838843221/heterographs.pkl
Number of complexes: 1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 9.061843872070312, std 0.0, max 9.061843872070312
distance protein-mol: mean 82.87322235107422, std 0.0, max 82.87322235107422
rmsd matching: mean 0.0, std 0.0, max 0
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
Reading molecules and generating local structures with RDKit
1it [00:00, 19.07it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.32it/s]
loading data from memory: data/cache_torsion_allatoms/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings2838843221/heterographs.pkl
Number of complexes: 1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 8.859946250915527, std 0.0, max 8.859946250915527
distance protein-mol: mean 82.7789077758789, std 0.0, max 82.7789077758789
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1. 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35
0.3 0.25 0.2 0.15 0.1 0.05]
Size of test dataset: 1
0it [00:00, ?it/s]Failed on ['data/4kmz_protein_only.pdb____data/FOL_model.sdf'] tensor_type->scalarType().has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/type_promotion.cpp":111, please report a bug to PyTorch. Missing Scalar Type information
0it [00:02, ?it/s]
Traceback (most recent call last):
File "/opt/conda/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/inference.py", line 205, in <module>
raise e
File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/inference.py", line 165, in <module>
data_list, confidence = sampling(data_list=data_list, model=model,
File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/utils/sampling.py", line 56, in sampling
tr_score, rot_score, tor_score = model(complex_graph_batch)
File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/models/score_model.py", line 247, in forward
rec_node_attr, rec_edge_index, rec_edge_attr, rec_edge_sh = self.build_rec_conv_graph(data)
File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/models/score_model.py", line 376, in build_rec_conv_graph
edge_sh = o3.spherical_harmonics(self.sh_irreps, edge_vec, normalize=True, normalization='component')
File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 180, in spherical_harmonics
return sh(x)
File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 82, in forward
sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
RuntimeError: tensor_type->scalarType().has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/type_promotion.cpp":111, please report a bug to PyTorch. Missing Scalar Type information
Hi, I just wanted to add that codegen failure also occurs on Colab standard GPU. Here's a part of the output with the error message:
0it [00:00, ?it/s]/usr/local/lib/python3.7/dist-packages/e3nn/o3/_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
(Triggered internally at ../torch/csrc/jit/codegen/cuda/manager.cpp:237.)
sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
1it [03:40, 220.92s/it]
Failed for 0 complexes
Skipped 0 complexes
The fallback path signifies that the GPU is not being used, right?
@ballaneypranav, @gcorso, @HannesStark I had the exact same error. Unfortunately I wasn't able to identify the cause, I believe it is due to some conflicting dependencies between PyTorch Geometric and PyTorch. Using a newer version of PyTorch should fix the issue however simply upgrading will cause further conflicts with PyTorch Geometric. Instead if you create a new environment and install only the required packages, you should avoid this error. Note, there is a new user warning UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in __init__. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in torch.jit.Attribute.
which I think is also related to the previous issue, but this doesn't seem to change anything.
Install a new environment as follows (modify for your CUDA version):
conda activate diffdock2
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
pip install PyYAML
python -m pip install scipy
pip install "networkx[default]"
pip install biopython
pip install rdkit-pypi
pip install e3nn
pip install spyrmsd
pip install pandas
pip install biopandas```
Thanks @ItamarChinn ! We updated the readme accordingly.
Hi, congratulations on your work, it is a very interesting approach and the results are amazing!
I was able to run the PDBbind examples, but I see the following error with other input files:
Failed on ['data/protein.pdb____data/ligands/10005.sdf'] tensor_type->scalarType().has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/type_promotion.cpp":111, please report a bug to PyTorch. Missing Scalar Type information
Do you have any idea what might be wrong?