caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7

Error message: F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)

The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..." Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.

I hope someone could check this issue. Thanks in advance.

Info: NVIDIA Docker: Caffe:19.06 NVCaffe: 0.17.3 CuDNN: 7.6.0 NCCL : 2.4.7 Model : bvlc_googlenet Batch size : 256

More logs: I0717 00:12:36.297857 545 data_layer.cpp:107] [n0.d4.r0] Transformer threads: 4 (auto) I0717 00:12:36.389331 609 internal_thread.cpp:78] Started internal thread 609 on device 4, rank 0 I0717 00:12:36.389572 609 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb I0717 00:12:36.399473 600 internal_thread.cpp:78] Started internal thread 600 on device 4, rank 0 I0717 00:12:36.405875 599 internal_thread.cpp:78] Started internal thread 599 on device 4, rank 0 I0717 00:12:36.408145 598 internal_thread.cpp:78] Started internal thread 598 on device 4, rank 0 I0717 00:12:36.409735 601 internal_thread.cpp:78] Started internal thread 601 on device 4, rank 0 F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2) Check failure stack trace: I0717 00:12:37.488199 597 blocking_queue.cpp:40] Waiting for datum F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered Check failure stack trace: @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encounteredF0717 00:12:37.506527 593 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered Check failure stack trace: @ 0x7fa8cf9345cd google::LogMessage::Fail() @ 0x7fa8d0359052 caffe::Blob::CopyFrom() @ 0x7fa8cf936433 google::LogMessage::SendToLog() @ 0x7fa8cf93415b google::LogMessage::Flush() @ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost() @ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7fa8d07dbbcb caffe::BatchTransformer<>::InternalThreadEntry() @ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost() @ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu() @ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu() @ 0x7fa8d02bdbb2 caffe::InternalThread::entry() @ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data() @ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data() @ 0x7fa8d02bfc2f boost::detail::thread_data<>::run() @ 0x7fa8cdcaf5d5 (unknown) @ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch() @ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch() @ 0x7fa8cd5686ba start_thread @ 0x7fa8cdfcb41d clone @ (nil) (unknown)

drnikolaev commented 5 years ago

@beanliao please attach your prototxt files if possible.

beanliao commented 5 years ago

@beanliao please attach your prototxt files if possible.

Please refer to below files. Thanks.

solver_fp16_4.prototxt.txt train_val_fp16_4.prototxt.txt

drnikolaev commented 5 years ago

@beanliao thank you. could you please run nvidia-smi topo -m and nvidia-smi topo -p2p n and paste outputs here?

beanliao commented 5 years ago

@drnikolaev Thanks for checking this. here's outputs

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7
GPU0	X	PIX	PXB	PXB	SYS	SYS	SYS	SYS
GPU1	PIX	X	PXB	PXB	SYS	SYS	SYS	SYS
GPU2	PXB	PXB	X	PIX	SYS	SYS	SYS	SYS
GPU3	PXB	PXB	PIX	X	SYS	SYS	SYS	SYS
GPU4	SYS	SYS	SYS	SYS	X	PIX	PXB	PXB
GPU5	SYS	SYS	SYS	SYS	PIX	X	PXB	PXB
GPU6	SYS	SYS	SYS	SYS	PXB	PXB	X	PIX
GPU7	SYS	SYS	SYS	SYS	PXB	PXB	PIX	X

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7
GPU0	X	NS	NS	NS	NS	NS	NS	NS
GPU1	NS	X	NS	NS	NS	NS	NS	NS
GPU2	NS	NS	X	NS	NS	NS	NS	NS
GPU3	NS	NS	NS	X	NS	NS	NS	NS
GPU4	NS	NS	NS	NS	X	NS	NS	NS
GPU5	NS	NS	NS	NS	NS	X	NS	NS
GPU6	NS	NS	NS	NS	NS	NS	X	NS
GPU7	NS	NS	NS	NS	NS	NS	NS	X

NVIDIA / caffe

[blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered - Training by 4 of 8 GPUs will fail #574

caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7