Open beanliao opened 5 years ago
@beanliao please attach your prototxt files if possible.
@beanliao please attach your prototxt files if possible.
Please refer to below files. Thanks.
@beanliao thank you. could you please run
nvidia-smi topo -m
and
nvidia-smi topo -p2p n
and paste outputs here?
@drnikolaev Thanks for checking this. here's outputs
GPU0 | GPU1 | GPU2 | GPU3 | GPU4 | GPU5 | GPU6 | GPU7 | |
---|---|---|---|---|---|---|---|---|
GPU0 | X | PIX | PXB | PXB | SYS | SYS | SYS | SYS |
GPU1 | PIX | X | PXB | PXB | SYS | SYS | SYS | SYS |
GPU2 | PXB | PXB | X | PIX | SYS | SYS | SYS | SYS |
GPU3 | PXB | PXB | PIX | X | SYS | SYS | SYS | SYS |
GPU4 | SYS | SYS | SYS | SYS | X | PIX | PXB | PXB |
GPU5 | SYS | SYS | SYS | SYS | PIX | X | PXB | PXB |
GPU6 | SYS | SYS | SYS | SYS | PXB | PXB | X | PIX |
GPU7 | SYS | SYS | SYS | SYS | PXB | PXB | PIX | X |
GPU0 | GPU1 | GPU2 | GPU3 | GPU4 | GPU5 | GPU6 | GPU7 | |
---|---|---|---|---|---|---|---|---|
GPU0 | X | NS | NS | NS | NS | NS | NS | NS |
GPU1 | NS | X | NS | NS | NS | NS | NS | NS |
GPU2 | NS | NS | X | NS | NS | NS | NS | NS |
GPU3 | NS | NS | NS | X | NS | NS | NS | NS |
GPU4 | NS | NS | NS | NS | X | NS | NS | NS |
GPU5 | NS | NS | NS | NS | NS | X | NS | NS |
GPU6 | NS | NS | NS | NS | NS | NS | X | NS |
GPU7 | NS | NS | NS | NS | NS | NS | NS | X |
I found that if I use 4 GPUs out of 8 GPUs , this will cause training failed.
caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7
Error message: F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..." Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.
I hope someone could check this issue. Thanks in advance.
Info: NVIDIA Docker: Caffe:19.06 NVCaffe: 0.17.3 CuDNN: 7.6.0 NCCL : 2.4.7 Model : bvlc_googlenet Batch size : 256
More logs: I0717 00:12:36.297857 545 data_layer.cpp:107] [n0.d4.r0] Transformer threads: 4 (auto) I0717 00:12:36.389331 609 internal_thread.cpp:78] Started internal thread 609 on device 4, rank 0 I0717 00:12:36.389572 609 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb I0717 00:12:36.399473 600 internal_thread.cpp:78] Started internal thread 600 on device 4, rank 0 I0717 00:12:36.405875 599 internal_thread.cpp:78] Started internal thread 599 on device 4, rank 0 I0717 00:12:36.408145 598 internal_thread.cpp:78] Started internal thread 598 on device 4, rank 0 I0717 00:12:36.409735 601 internal_thread.cpp:78] Started internal thread 601 on device 4, rank 0 F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2) Check failure stack trace: I0717 00:12:37.488199 597 blocking_queue.cpp:40] Waiting for datum F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered Check failure stack trace: @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encounteredF0717 00:12:37.506527 593 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered Check failure stack trace: @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8d0359052 caffe::Blob::CopyFrom() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7fa8d07dbbcb caffe::BatchTransformer<>::InternalThreadEntry() @ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost() @ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu() @ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu() @ 0x7fa8d02bdbb2 caffe::InternalThread::entry() @ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data() @ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data() @ 0x7fa8d02bfc2f boost::detail::thread_data<>::run() @ 0x7fa8cdcaf5d5 (unknown) @ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch() @ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch() @ 0x7fa8cd5686ba start_thread @ 0x7fa8cdfcb41d clone @ (nil) (unknown)