Open lswzjuer opened 4 years ago
the error message is /pytorch/aten/src/THCUNN/BCECriterion.cu:57: void bce_updateOutput_no_reduce_functor<Dtype, Acctype>::operator()(const Dtype , const Dtype , Dtype ) [with Dtype = float, Acctype = float]: block: [108,0,0], thread: [255,0,0] Assertion `input >= 0. && *input <= 1.` failed.
train: sudo CUDA_VISIBLE_DEVICES=0,1 python train.py --config=yolact_im300_config --batch_size=16
Try the latest master branch. I just added a fix for the inf / nan loss explosion issue (see #222).
As for what modifications to make, how much fps do you need? Is the performance of resnet50 on 550x550 images acceptable? I ask this because the architecture itself is not very good at handling very small objects so it might be very beneficial to upscale the images to 550x550 and then classify them.
If you want to still use 300x300, I'd halve all of the anchor sizes in "pred_scales" (in yolact_base). Also if you want to train the model to use 300x300 images, set max_size to 300.
Hi,
for a batch size of 32, with 4x32GB Tesla V100 GPUs, i am getting this error:
RuntimeError: DataLoader worker (pid 1851) is killed by signal: Bus error.
It seems that only one GPU was utilized despite specifying all 4 GPUs for training
Also getting this warning:
Per-GPU batch size is less than the recommended limit for batch norm. Disabling batch norm.
The entire log:
*_ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
send_bytes(obj)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "train.py", line 504, in
hello, your job is very good ! Thanks for your code.
ISSURE:
Thanks