Open 0ut0fcontrol opened 3 years ago
python examples/binding_affinity_prediction/main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015
will raise RuntimeError:
File "/pubhome/jcyang/git/handbook/Contribute/issue64-potentialnet/dgl-lifesci/examples/binding_affinity_prediction/dgllife/model/model_zoo/potentialnet.py", line 180, in forward eids = graph.edata['e'][:,i].nonzero(as_tuple=False).view(-1).type(graph.idtype) RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'dmlc::Error' what(): [16:33:13] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:103: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
This CUDA RuntimeError
disappear when change n_spatial_conv_steps
from 2 to 1.
CUDA, pytorch, and dgl versions have nothing to do with this error.
reported to @cyhFlight and waiting for his response.
我将服务器(GTX1080Ti)和我自己的电脑(GTX750Ti)都更新为 dgl-cuda11.0
, GTX750Ti可以跑通refined set, 而GTX1080Ti还是同样报错。
将refined set配置中的 n_spatial_conv_steps 从2改成1,两台机器,不管是什么conda环境,都可以跑通。
看来是1080Ti有问题。
你这两块卡的GPU memory是一样大吗?另外服务器和自己的电脑上是都只有一块GPU吗?n_spatial_conv_steps
为2的时候会不会OOM了?PyTorch里应该有挺多人遇到过各种原因的Illegal memory access,你有试过那些帖子里的建议吗?
server:
[jcyang@k223:~]$ nvidia-smi
Fri Dec 18 16:08:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 0% 51C P8 18W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:06:00.0 Off | N/A |
| 0% 38C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[jcyang@k223:~]$
local:
(dgl110) [jcyang@x038:binding_affinity_prediction]$ nvidia-smi
Fri Dec 18 16:09:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 00000000:01:00.0 On | N/A |
| 40% 39C P8 1W / 58W | 308MiB / 2000MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2019 G /usr/libexec/Xorg 229MiB |
| 0 N/A N/A 2201 G /usr/bin/gnome-shell 34MiB |
| 0 N/A N/A 2983 G ...lib/vmware/bin/vmware-vmx 37MiB |
+-----------------------------------------------------------------------------+
(dgl110) [jcyang@x038:binding_affinity_prediction]$
配置如上,我会强制server使用gpu0:
CUDA_VISIBLE_DEVICES=0 python main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015
在local跑时,因为内存小,batch_size 从40改成了16,n_spatial_conv_steps=1或者2都可以跑通。
在server,内存肯定是够的,n_spatial_conv_steps=1可以跑通,n_spatial_conv_steps=2会跑错。
(dgl110) [jcyang@k223:binding_affinity_prediction]$ CUDA_VISIBLE_DEVICES=0 python main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015
Using backend: pytorch
model: PotentialNet
dataset: PDBBind
version: v2015
exp: PotentialNet_PDBBind_refined_pocket_random
subset: refined
load_binding_pocket: True
random_seed: 123
frac_train: 0.8
frac_val: 0.2
frac_test: 0.0
batch_size: 40
shuffle: False
max_num_neighbors: 5
distance_bins: [1.5, 2.5, 3.5, 4.5]
f_in: 40
f_bond: 73
f_gather: 90
f_spatial: 90
n_rows_fc: [32]
n_bond_conv_steps: 2
n_spatial_conv_steps: 2
dropouts: [0.25, 0.25, 0.25]
lr: 0.001
num_epochs: 2
wd: 1e-07
metrics: ['r2', 'mae']
split: random
Loading ligands...
Loading proteins...
Finished cleaning the dataset, got 3706/3706 valid pairs
Start constructing graphs and featurizing them.
Done constructing 3706 graphs.
Loading ligands...
Loading proteins...
Finished cleaning the dataset, got 195/195 valid pairs
Start constructing graphs and featurizing them.
Done constructing 195 graphs.
epoch 1/2, training | loss 0.9281, r2 0.1032, mae 1.5604
validation results, r2 0.1958, mae 1.5277
test results, r2 0.0001, mae 2.1811
Traceback (most recent call last):
File "main.py", line 153, in <module>
main(args)
File "main.py", line 109, in main
train_scores = run_a_train_epoch(args, epoch, model, train_loader, loss_fn, optimizer)
File "main.py", line 33, in run_a_train_epoch
prediction = model(bigraph_canonical, knn_graph)
File "/tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/pubhome/jcyang/git/handbook/Contribute/issue64-potentialnet/dgl-lifesci/examples/binding_affinity_prediction/dgllife/model/model_zoo/potentialnet.py", line 57, in forward
h = self.stage_2_model(graph=knn_graph, feat=h)
File "/tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/pubhome/jcyang/git/handbook/Contribute/issue64-potentialnet/dgl-lifesci/examples/binding_affinity_prediction/dgllife/model/model_zoo/potentialnet.py", line 180, in forward
eids = graph.edata['e'][:,i].nonzero(as_tuple=False).view(-1).type(graph.idtype)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'dmlc::Error'
what(): [14:29:37] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:103: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
Stack trace:
[bt] (0) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fff07179c8f]
[bt] (1) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::FreeDataSpace(DLContext, void*)+0x15c) [0x7fff079957bc]
[bt] (2) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Internal::DefaultDeleter(dgl::runtime::NDArray::Container*)+0x1ad) [0x7fff0785750d]
[bt] (3) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::COO::~COO()+0x127) [0x7fff079703e7]
[bt] (4) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::~UnitGraph()+0x1ba) [0x7fff0797017a]
[bt] (5) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(dgl::HeteroGraph::~HeteroGraph()+0x119) [0x7fff0786f2a9]
[bt] (6) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/libdgl.so(DGLObjectFree+0xb5) [0x7fff0782e545]
[bt] (7) /tmp/jincai_conda3/envs/dgl110/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x1023d) [0x7fff06c3c23d]
[bt] (8) python(+0x16bbc6) [0x5555556bfbc6]
Aborted
(dgl110) [jcyang@k223:binding_affinity_prediction]$
Illegal memory access
这个太常见了,我刚确认了可能是我们server的问题,还没有仔细去试。
你能用debug模式,如ipdb进去看下吗?是不是graph.edata['e']
不会报错而graph.edata['e'][:,i]
就会报错?
是的。
@mufeili @cyhFlight 使用从源码编译的pytorch,成功运行,训练100 epoch也没有报错。 之前可能的原因是服务器的CentOS7太老了,conda编译的包不兼容导致报错。
以下是命令和环境测试结果:
cd dgl-lifesci/examples/binding_affinity_prediction
# n_spatial_conv_steps: 2
CUDA_VISIBLE_DEVICES=0 python main.py -m PotentialNet -d PDBBind_refined_pocket_random -v v2015
server: CentOS7, GTX1080Ti, CUDA 11.1 | pytorch | dgl-cuda11.0 | python | success |
---|---|---|---|---|
source | pip | 3.6 | Yes | |
conda | pip | 3.7 | No | |
conda | conda | 3.7 | No | |
conda | conda | 3.6 | No |
PS: 我local的机器是Fedora33.
把pytorch降级为1.5, 就可以愉快地用conda安装了。pytorch 1.6可能也行,没有试。
conda create -n dgl92-py36-torch15 -c pytorch -c dglteam -c rdkit rdkit pytorch=1.5 torchvision cudatoolkit=9.2 dgl-cuda9.2 dgllife python=3.6
可以复现 @cyhFlight 提供的精度。
我这边没有问题了,感谢两位的帮助。@cyhFlight @mufeili
通过对比yuhang和jincai两个环境的差异,我安装了torch170, default, rdkit2018 三个conda环境,但是没有一个是work的。🤣
你装一下我的环境看能不能跑通?看能不能定位出问题?
envs.zip