NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

[blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered - Training by 4 of 8 GPUs will fail #574

Open beanliao opened 5 years ago

beanliao commented 5 years ago

I found that if I use 4 GPUs out of 8 GPUs , this will cause training failed.

caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7

Error message: F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)

The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..." Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.

I hope someone could check this issue. Thanks in advance.

Info: NVIDIA Docker: Caffe:19.06 NVCaffe: 0.17.3 CuDNN: 7.6.0 NCCL : 2.4.7 Model : bvlc_googlenet Batch size : 256

More logs: I0717 00:12:36.297857 545 data_layer.cpp:107] [n0.d4.r0] Transformer threads: 4 (auto) I0717 00:12:36.389331 609 internal_thread.cpp:78] Started internal thread 609 on device 4, rank 0 I0717 00:12:36.389572 609 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb I0717 00:12:36.399473 600 internal_thread.cpp:78] Started internal thread 600 on device 4, rank 0 I0717 00:12:36.405875 599 internal_thread.cpp:78] Started internal thread 599 on device 4, rank 0 I0717 00:12:36.408145 598 internal_thread.cpp:78] Started internal thread 598 on device 4, rank 0 I0717 00:12:36.409735 601 internal_thread.cpp:78] Started internal thread 601 on device 4, rank 0 F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2) Check failure stack trace: I0717 00:12:37.488199 597 blocking_queue.cpp:40] Waiting for datum F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered Check failure stack trace: @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encounteredF0717 00:12:37.506527 593 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered Check failure stack trace: @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8d0359052 caffe::Blob::CopyFrom() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7fa8d07dbbcb caffe::BatchTransformer<>::InternalThreadEntry() @ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost() @ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu() @ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu() @ 0x7fa8d02bdbb2 caffe::InternalThread::entry() @ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data() @ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data() @ 0x7fa8d02bfc2f boost::detail::thread_data<>::run() @ 0x7fa8cdcaf5d5 (unknown) @ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch() @ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch() @ 0x7fa8cd5686ba start_thread @ 0x7fa8cdfcb41d clone @ (nil) (unknown)

drnikolaev commented 5 years ago

@beanliao please attach your prototxt files if possible.

beanliao commented 5 years ago

@beanliao please attach your prototxt files if possible.

Please refer to below files. Thanks.

solver_fp16_4.prototxt.txt train_val_fp16_4.prototxt.txt

drnikolaev commented 5 years ago

@beanliao thank you. could you please run nvidia-smi topo -m and nvidia-smi topo -p2p n and paste outputs here?

beanliao commented 5 years ago

@drnikolaev Thanks for checking this. here's outputs

  GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X PIX PXB PXB SYS SYS SYS SYS
GPU1 PIX X PXB PXB SYS SYS SYS SYS
GPU2 PXB PXB X PIX SYS SYS SYS SYS
GPU3 PXB PXB PIX X SYS SYS SYS SYS
GPU4 SYS SYS SYS SYS X PIX PXB PXB
GPU5 SYS SYS SYS SYS PIX X PXB PXB
GPU6 SYS SYS SYS SYS PXB PXB X PIX
GPU7 SYS SYS SYS SYS PXB PXB PIX X
  GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NS NS NS NS NS NS NS
GPU1 NS X NS NS NS NS NS NS
GPU2 NS NS X NS NS NS NS NS
GPU3 NS NS NS X NS NS NS NS
GPU4 NS NS NS NS X NS NS NS
GPU5 NS NS NS NS NS X NS NS
GPU6 NS NS NS NS NS NS X NS
GPU7 NS NS NS NS NS NS NS X