PyTorch Internal Assert Failed

ballaneypranav commented 2 years ago

Hi, congratulations on your work, it is a very interesting approach and the results are amazing!

I was able to run the PDBbind examples, but I see the following error with other input files: Failed on ['data/protein.pdb____data/ligands/10005.sdf'] tensor_type->scalarType().has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/type_promotion.cpp":111, please report a bug to PyTorch. Missing Scalar Type information

Do you have any idea what might be wrong?

gcorso commented 2 years ago

Hi @ballaneypranav

Could you add a line raise e to this point in the inference script https://github.com/gcorso/DiffDock/blob/f8d67b5b2b30b72eedd010e76accc1a306ee605f/inference.py#L204 and rerun with your input file, so that it will print the full stack trace of the error and we may understand where the source of the problem is?

ballaneypranav commented 2 years ago

Thank you for your response. DiffDock works as expected on a CPU, but when I try to use a GPU, I see a warning that codegen failed and a fallback path was taken. This is the full output:

(diffdock) Singularity> python -m inference --protein_ligand_csv data/protein_ligand.csv --out_dir data/output --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:14<00:00,  2.72it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:24<00:00,  2.38it/s]
/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/utils/torus.py:38: RuntimeWarning: invalid value encountered in divide
  score_ = grad(x, sigma[:, None], N=100) / p_
Reading molecules and generating local structures with RDKit
1it [00:00, 13.51it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.17it/s]
loading data from memory:  data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings2838843221/heterographs.pkl
Number of complexes:  1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 9.061843872070312, std 0.0, max 9.061843872070312
distance protein-mol: mean 82.87322235107422, std 0.0, max 82.87322235107422
rmsd matching: mean 0.0, std 0.0, max 0
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
Reading molecules and generating local structures with RDKit
1it [00:00, 18.67it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.02it/s]
loading data from memory:  data/cache_torsion_allatoms/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings2838843221/heterographs.pkl
Number of complexes:  1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 8.859946250915527, std 0.0, max 8.859946250915527
distance protein-mol: mean 82.7789077758789, std 0.0, max 82.7789077758789
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1.   0.95 0.9  0.85 0.8  0.75 0.7  0.65 0.6  0.55 0.5  0.45 0.4  0.35
 0.3  0.25 0.2  0.15 0.1  0.05]
Size of test dataset:  1
0it [00:00, ?it/s]/opt/conda/envs/diffdock/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/manager.cpp:237.)
  sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
1it [00:58, 58.41s/it]
Failed for 0 complexes
Skipped 0 complexes
Results are in data/output

After setting PYTORCH_NVFUSER_DISABLE=fallback and export PYTORCH_JIT_LOG_LEVEL=manager.cpp and adding raise e to inference.py, I see the following output:

(diffdock) Singularity> python -m inference --protein_ligand_csv data/protein_ligand.csv --out_dir data/output --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:14<00:00,  2.71it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:24<00:00,  2.38it/s]
/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/utils/torus.py:38: RuntimeWarning: invalid value encountered in divide
  score_ = grad(x, sigma[:, None], N=100) / p_
Reading molecules and generating local structures with RDKit
1it [00:00, 15.58it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.05it/s]
loading data from memory:  data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings2838843221/heterographs.pkl
Number of complexes:  1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 9.061843872070312, std 0.0, max 9.061843872070312
distance protein-mol: mean 82.87322235107422, std 0.0, max 82.87322235107422
rmsd matching: mean 0.0, std 0.0, max 0
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
Reading molecules and generating local structures with RDKit
1it [00:00, 19.07it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.32it/s]
loading data from memory:  data/cache_torsion_allatoms/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings2838843221/heterographs.pkl
Number of complexes:  1
radius protein: mean 26.61678695678711, std 0.0, max 26.61678695678711
radius molecule: mean 8.859946250915527, std 0.0, max 8.859946250915527
distance protein-mol: mean 82.7789077758789, std 0.0, max 82.7789077758789
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1.   0.95 0.9  0.85 0.8  0.75 0.7  0.65 0.6  0.55 0.5  0.45 0.4  0.35
 0.3  0.25 0.2  0.15 0.1  0.05]
Size of test dataset:  1
0it [00:00, ?it/s]Failed on ['data/4kmz_protein_only.pdb____data/FOL_model.sdf'] tensor_type->scalarType().has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/type_promotion.cpp":111, please report a bug to PyTorch. Missing Scalar Type information
0it [00:02, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/inference.py", line 205, in <module>
    raise e
  File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/inference.py", line 165, in <module>
    data_list, confidence = sampling(data_list=data_list, model=model,
  File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/utils/sampling.py", line 56, in sampling
    tr_score, rot_score, tor_score = model(complex_graph_batch)
  File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/models/score_model.py", line 247, in forward
    rec_node_attr, rec_edge_index, rec_edge_attr, rec_edge_sh = self.build_rec_conv_graph(data)
  File "/anvil/projects/x-cis220051/corporate/atom/data/dl_htvs/folr2-diffdock/DiffDock/models/score_model.py", line 376, in build_rec_conv_graph
    edge_sh = o3.spherical_harmonics(self.sh_irreps, edge_vec, normalize=True, normalization='component')
  File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 180, in spherical_harmonics
    return sh(x)
  File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/diffdock/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 82, in forward
    sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
RuntimeError: tensor_type->scalarType().has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/type_promotion.cpp":111, please report a bug to PyTorch. Missing Scalar Type information

ballaneypranav commented 2 years ago

Hi, I just wanted to add that codegen failure also occurs on Colab standard GPU. Here's a part of the output with the error message:

0it [00:00, ?it/s]/usr/local/lib/python3.7/dist-packages/e3nn/o3/_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
 (Triggered internally at  ../torch/csrc/jit/codegen/cuda/manager.cpp:237.)
  sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
1it [03:40, 220.92s/it]
Failed for 0 complexes
Skipped 0 complexes

The fallback path signifies that the GPU is not being used, right?

ItamarChinn commented 1 year ago

@ballaneypranav, @gcorso, @HannesStark I had the exact same error. Unfortunately I wasn't able to identify the cause, I believe it is due to some conflicting dependencies between PyTorch Geometric and PyTorch. Using a newer version of PyTorch should fix the issue however simply upgrading will cause further conflicts with PyTorch Geometric. Instead if you create a new environment and install only the required packages, you should avoid this error. Note, there is a new user warning UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in __init__. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in torch.jit.Attribute. which I think is also related to the previous issue, but this doesn't seem to change anything.

Install a new environment as follows (modify for your CUDA version):


conda activate diffdock2
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
pip install PyYAML
python -m pip install scipy
pip install "networkx[default]"
pip install biopython
pip install rdkit-pypi
pip install e3nn
pip install spyrmsd
pip install pandas
pip install biopandas```

HannesStark commented 1 year ago

Thanks @ItamarChinn ! We updated the readme accordingly.

gcorso / DiffDock

PyTorch Internal Assert Failed #43