0ut0fcontrol / sharing

sharing files.
MIT License
0 stars 0 forks source link

Debugging PoteintialNet in DGL #3

Open 0ut0fcontrol opened 3 years ago

0ut0fcontrol commented 3 years ago

通过对比yuhang和jincai两个环境的差异,我安装了torch170, default, rdkit2018 三个conda环境,但是没有一个是work的。🤣

你装一下我的环境看能不能跑通?看能不能定位出问题?

envs.zip

conda env success conda-forge cuda pytorch cmd
yuhang Yes No 10.2 1.7.0 -
jincai No Yes 10.2 1.7.1 conda create -n jincai -c conda-forge -c pytorch -c dglteam rdkit pytorch=1.7.0 torchvision cudatoolkit=10.2 dgl-cuda10.2 hyperopt dgllife
torch170 No Yes 10.2 1.7.0 conda create -n torch170 -c conda-forge -c pytorch -c dglteam rdkit pytorch=1.7.0 torchvision cudatoolkit=10.2 dgl-cuda10.2 hyperopt dgllife
default No No 10.2 1.7.1 conda create -n dgl102-default -c pytorch -c dglteam pytorch torchvision cudatoolkit=10.2 dgl-cuda10.2 dgllife -y; conda install -c rdkit rdkit
rdkit2018 No No 10.2 1.7.1 conda create -n dgl102-rdkit -c rdkit -c pytorch -c dglteam pytorch torchvision cudatoolkit=10.2 dgl-cuda10.2 dgllife rdkit=2018.09.3 -y
0ut0fcontrol commented 3 years ago
python examples/binding_affinity_prediction/main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015

will raise RuntimeError:

File "/pubhome/jcyang/git/handbook/Contribute/issue64-potentialnet/dgl-lifesci/examples/binding_affinity_prediction/dgllife/model/model_zoo/potentialnet.py", line 180, in forward eids = graph.edata['e'][:,i].nonzero(as_tuple=False).view(-1).type(graph.idtype) RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'dmlc::Error' what(): [16:33:13] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:103: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered

This CUDA RuntimeError disappear when change n_spatial_conv_steps from 2 to 1. CUDA, pytorch, and dgl versions have nothing to do with this error.

reported to @cyhFlight and waiting for his response.

0ut0fcontrol commented 3 years ago

我将服务器(GTX1080Ti)和我自己的电脑(GTX750Ti)都更新为 dgl-cuda11.0, GTX750Ti可以跑通refined set, 而GTX1080Ti还是同样报错。

将refined set配置中的 n_spatial_conv_steps 从2改成1,两台机器,不管是什么conda环境,都可以跑通。

看来是1080Ti有问题。

mufeili commented 3 years ago

你这两块卡的GPU memory是一样大吗?另外服务器和自己的电脑上是都只有一块GPU吗?n_spatial_conv_steps为2的时候会不会OOM了?PyTorch里应该有挺多人遇到过各种原因的Illegal memory access,你有试过那些帖子里的建议吗?

0ut0fcontrol commented 3 years ago

server:

[jcyang@k223:~]$ nvidia-smi 
Fri Dec 18 16:08:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   51C    P8    18W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
|  0%   38C    P8     9W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[jcyang@k223:~]$ 

local:

(dgl110) [jcyang@x038:binding_affinity_prediction]$ nvidia-smi 
Fri Dec 18 16:09:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |
| 40%   39C    P8     1W /  58W |    308MiB /  2000MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2019      G   /usr/libexec/Xorg                 229MiB |
|    0   N/A  N/A      2201      G   /usr/bin/gnome-shell               34MiB |
|    0   N/A  N/A      2983      G   ...lib/vmware/bin/vmware-vmx       37MiB |
+-----------------------------------------------------------------------------+
(dgl110) [jcyang@x038:binding_affinity_prediction]$ 

配置如上,我会强制server使用gpu0:

CUDA_VISIBLE_DEVICES=0 python main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015

在local跑时,因为内存小,batch_size 从40改成了16,n_spatial_conv_steps=1或者2都可以跑通。

在server,内存肯定是够的,n_spatial_conv_steps=1可以跑通,n_spatial_conv_steps=2会跑错。

每次都是相同的地方报错: https://github.com/cyhFlight/dgl-lifesci/blob/42a3734e47f92c436610340ae7f0e12826e23f4f/python/dgllife/model/model_zoo/potentialnet.py#L180

(dgl110) [jcyang@k223:binding_affinity_prediction]$ CUDA_VISIBLE_DEVICES=0 python main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015
Using backend: pytorch
model: PotentialNet
dataset: PDBBind
version: v2015
exp: PotentialNet_PDBBind_refined_pocket_random
subset: refined
load_binding_pocket: True
random_seed: 123
frac_train: 0.8
frac_val: 0.2
frac_test: 0.0
batch_size: 40
shuffle: False
max_num_neighbors: 5
distance_bins: [1.5, 2.5, 3.5, 4.5]
f_in: 40
f_bond: 73
f_gather: 90
f_spatial: 90
n_rows_fc: [32]
n_bond_conv_steps: 2
n_spatial_conv_steps: 2
dropouts: [0.25, 0.25, 0.25]
lr: 0.001
num_epochs: 2
wd: 1e-07
metrics: ['r2', 'mae']
split: random

Loading ligands...
Loading proteins...
Finished cleaning the dataset, got 3706/3706 valid pairs
Start constructing graphs and featurizing them.
Done constructing 3706 graphs.
Loading ligands...
Loading proteins...
Finished cleaning the dataset, got 195/195 valid pairs
Start constructing graphs and featurizing them.
Done constructing 195 graphs.
epoch 1/2, training | loss 0.9281, r2 0.1032, mae 1.5604
validation results, r2 0.1958, mae 1.5277
test results, r2 0.0001, mae 2.1811

Traceback (most recent call last):
  File "main.py", line 153, in <module>
    main(args)
  File "main.py", line 109, in main
    train_scores = run_a_train_epoch(args, epoch, model, train_loader, loss_fn, optimizer)
  File "main.py", line 33, in run_a_train_epoch
    prediction = model(bigraph_canonical, knn_graph)
  File "/tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/pubhome/jcyang/git/handbook/Contribute/issue64-potentialnet/dgl-lifesci/examples/binding_affinity_prediction/dgllife/model/model_zoo/potentialnet.py", line 57, in forward
    h = self.stage_2_model(graph=knn_graph, feat=h)
  File "/tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/pubhome/jcyang/git/handbook/Contribute/issue64-potentialnet/dgl-lifesci/examples/binding_affinity_prediction/dgllife/model/model_zoo/potentialnet.py", line 180, in forward
    eids = graph.edata['e'][:,i].nonzero(as_tuple=False).view(-1).type(graph.idtype)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'dmlc::Error'
  what():  [14:29:37] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:103: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
Stack trace:
  [bt] (0) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fff07179c8f]
  [bt] (1) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::FreeDataSpace(DLContext, void*)+0x15c) [0x7fff079957bc]
  [bt] (2) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Internal::DefaultDeleter(dgl::runtime::NDArray::Container*)+0x1ad) [0x7fff0785750d]
  [bt] (3) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::COO::~COO()+0x127) [0x7fff079703e7]
  [bt] (4) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::~UnitGraph()+0x1ba) [0x7fff0797017a]
  [bt] (5) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::HeteroGraph::~HeteroGraph()+0x119) [0x7fff0786f2a9]
  [bt] (6) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(DGLObjectFree+0xb5) [0x7fff0782e545]
  [bt] (7) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x1023d) [0x7fff06c3c23d]
  [bt] (8) python(+0x16bbc6) [0x5555556bfbc6]

Aborted
(dgl110) [jcyang@k223:binding_affinity_prediction]$

Illegal memory access这个太常见了,我刚确认了可能是我们server的问题,还没有仔细去试。

mufeili commented 3 years ago

你能用debug模式,如ipdb进去看下吗?是不是graph.edata['e']不会报错而graph.edata['e'][:,i]就会报错?

0ut0fcontrol commented 3 years ago

是的。

0ut0fcontrol commented 3 years ago

@mufeili @cyhFlight 使用从源码编译的pytorch,成功运行,训练100 epoch也没有报错。 之前可能的原因是服务器的CentOS7太老了,conda编译的包不兼容导致报错。

以下是命令和环境测试结果:

cd dgl-lifesci/examples/binding_affinity_prediction
# n_spatial_conv_steps: 2
CUDA_VISIBLE_DEVICES=0 python main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015
server: CentOS7, GTX1080Ti, CUDA 11.1 pytorch dgl-cuda11.0 python success
source pip 3.6 Yes
conda pip 3.7 No
conda conda 3.7 No
conda conda 3.6 No

PS: 我local的机器是Fedora33.

0ut0fcontrol commented 3 years ago

把pytorch降级为1.5, 就可以愉快地用conda安装了。pytorch 1.6可能也行,没有试。

conda create -n dgl92-py36-torch15 -c pytorch -c dglteam -c rdkit rdkit pytorch=1.5 torchvision cudatoolkit=9.2 dgl-cuda9.2 dgllife python=3.6

可以复现 @cyhFlight 提供的精度。

我这边没有问题了,感谢两位的帮助。@cyhFlight @mufeili