Open tereka114 opened 6 years ago
Hi @tereka114, it seems like there are two issues here (1) GPUs being used and (2) double free or corruption error. Regarding (1), are you sure that Detectron is actually using 4 GPUs rather than docker? Could you try limiting docker to 1 GPU (e.g. by setting CUDA_VISIBLE_DEVICES
; see also this page).
Hi @ir413 Thank you for your comment. I already tried CUDA_VISIBLE_DEVICES after that post(This gpu number is not used), But sometimes same error is happening.
Maybe try to change the batchsize to a low value. For example, 1, and then increase.
try to remove the docker-compose
I am trying to train using detectron for custom model, But Sometimes Error is occurred on training.
Question
1.I use official docker. In container, I exec program, and that process use 4 GPUS on host (One gpu is most usage, but other gpu is a little). Why detectron(caffe2) use 4 gpus(all?)?
2.I wanna solve the problem for long time traning. Please advice for solving.
I write detail as follows
Expected results
finish training in the end.
Actual results
Learning stops halfway.
Error Message
Result of nvidia-smi(Host)
Result of nvidia-smi(Container)
Reproduce Command
Command
Settings
System information