NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

Program exits with sigsegv when training with only one gpu in 0.17 #582

Closed lxlee closed 3 years ago

lxlee commented 4 years ago

Hi: I downloaded the source code, compiled it with CUDNN,NCCL,MPI flags on. But when I terminated the program when training with only one gpu, the program exited with sigsegv. I tried to debug the program and found that the parameter callback in Solver::Step function is nullptr. So shall I add an "if statement" before callback->cancel_all() to fix the bug? Or something else while compiling is wrong. Many Thanks.

drnikolaev commented 4 years ago

@lxlee may I see the code please?

lxlee commented 4 years ago

@drnikolaev Thanks for response. The original code is in solver.cpp:

if (SolverAction::STOP == request) {
      callback_->cancel_all();
      total_lapse_ += iteration_timer_->Seconds();
      // Break out of training loop.
      break;
}

The command I ran was : caffe train -gpu 0 -solver solver.prototxt

The error I got was:

^C*** Aborted at 1578452364 (unix time) try "date -d @1578452364" if you are using GNU date ***
PC: @     0x7fcb6fc31761 caffe::Solver::Step()
*** SIGSEGV (@0x0) received by PID 24155 (TID 0x7fcb709e8ec0) from PID 0; stack trace: ***
    @     0x7fcb6ce27f20 (unknown)
    @     0x7fcb6fc31761 caffe::Solver::Step()
    @     0x7fcb6fc31e69 caffe::Solver::Solve()
    @     0x564ac8441e03 train()
    @     0x564ac8446db5 main
    @     0x7fcb6ce0ab97 __libc_start_main
    @     0x564ac844004a _start

So I want to add a line like that:

if (SolverAction::STOP == request) {
     if (callback_ != nullptr)
         callback_->cancel_all();
     total_lapse_ += iteration_timer_->Seconds();
     // Break out of training loop.
     break;
}

since the callback_ is nullptr while in single gpu mode. Is that ok?

drnikolaev commented 3 years ago

v0.17.4