BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.12k stars 18.69k forks source link

NCCL & LevelDB #6204

Open farhan333 opened 6 years ago

farhan333 commented 6 years ago

Is this a bug? Training using NCCL with 2 gpus 1080 and 1060 and a LevelDB Data Layer?

When using single GPU this does not happen.

It naively appears to me that the levelDB is trying to be opened twice.

I0129 13:13:42.833976 25110 net.cpp:213] pool1 needs backward computation. I0129 13:13:42.833979 25110 net.cpp:213] relu1 needs backward computation. I0129 13:13:42.833981 25110 net.cpp:213] conv1 needs backward computation. I0129 13:13:42.833984 25110 net.cpp:215] data does not need backward computation. I0129 13:13:42.833986 25110 net.cpp:257] This network produces output loss I0129 13:13:42.833997 25110 net.cpp:270] Network initialization done. I0129 13:13:42.834034 25110 solver.cpp:56] Solver scaffolding done. I0129 13:13:42.834445 25110 caffe.cpp:248] Starting Optimization F0129 13:13:43.096998 25119 db_leveldb.cpp:16] Check failed: status.ok() Failed to open leveldb /home/farhan/intl910-200a/Training_1 IO error: lock /home/farhan/intl910-200a/Training_1/LOCK: already held by process ** Check failure stack trace: @ 0x7f4419d835cd google::LogMessage::Fail() @ 0x7f4419d85433 google::LogMessage::SendToLog() @ 0x7f4419d8315b google::LogMessage::Flush() @ 0x7f4419d85e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f441a46fd8b caffe::db::LevelDB::Open() @ 0x7f441a3e27ff caffe::DataLayer<>::DataLayer() @ 0x7f441a3e29c2 caffe::Creator_DataLayer<>() @ 0x7f441a48d3e0 caffe::Net<>::Init() @ 0x7f441a4903fe caffe::Net<>::Net() @ 0x7f441a2d9405 caffe::Solver<>::InitTrainNet() @ 0x7f441a2da875 caffe::Solver<>::Init() @ 0x7f441a2dab8f caffe::Solver<>::Solver() @ 0x7f441a49c941 caffe::Creator_SGDSolver<>() @ 0x416e0c caffe::SolverRegistry<>::CreateSolver() @ 0x7f441a4c5ecb caffe::Worker<>::InternalThreadEntry() @ 0x7f441a4afba5 caffe::InternalThread::entry() @ 0x7f441a4b0ace boost::detail::thread_data<>::run() @ 0x7f4418a545d5 (unknown) @ 0x7f441882d6ba start_thread @ 0x7f4418d703dd clone @ (nil) (unknown) Aborted (core dumped)**

Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help. Do not post such requests to Issues. Doing so interferes with the development of Caffe.

Please read the guidelines for contributing before submitting this issue.

Issue summary

Steps to reproduce

If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.

Your system configuration

Operating system: Compiler: CUDA version (if applicable): CUDNN version (if applicable): BLAS: Python or MATLAB version (for pycaffe and matcaffe respectively):

lmy418lmy commented 5 years ago

I have encountered the same problem and I would like to ask you how to solve it?

ShawKai666 commented 5 years ago

I have encountered the same problem and I would like to ask you how to solve it? Convert data to LMDB!