Closed Pampamkun27 closed 3 years ago
It looks like your training fails during testing, not the training itself. The batch size for the testing during training is currently hardcoded to 8, which is not optimal and should be fixed soon. You can try an change it at line 160 in the train.py
.
Hi,may i ask u a question? I upload 5500 pictures on colab to train,but its' speed is too slowly,colab's gpu is Tesla T4,my pc is GTX 1050,but my pc trains faster than colab.I don't know why,what's your training speed per batch?thanks
You could be IO or CPU bound when running on colab. I get about 6 batches (32 batch size) per second on my 2080ti with a threadripper cpu and nvme storage.
I closed this issue due to inactivity. Feel free to reopen for further discussion.
I tried to train in colab. But colab always runs out of memory not for long after i started training. After that, i got this error and training will stop.
---- Evaluating Model ---- Detecting objects: 0% 0/1 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 160, in <module> Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/queues.py", line 240, in _feed send_bytes(obj) File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe batch_size=8, File "/content/gdrive/My Drive/darknet/MyFinalYolo/test.py", line 48, in evaluate outputs = non_max_suppression(outputs, conf_thres=conf_thres, nms_thres=nms_thres) File "/content/gdrive/My Drive/darknet/MyFinalYolo/utils/utils.py", line 252, in non_max_suppression large_overlap = bbox_iou(detections[0, :4].unsqueeze(0), detections[:, :4]) > nms_thres File "/content/gdrive/My Drive/darknet/MyFinalYolo/utils/utils.py", line 205, in bbox_iou b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 906) is killed by signal: Killed. Detecting objects: 0% 0/1 [06:54<?, ?it/s]
I used 16 images to detect 2 objects for training. Here is my net config. Also i tried to change batch and subdivision to smaller and larger values, yet seems it does not change anything. `[net]
Testing
batch=24
subdivisions=8
Training
batch=8 subdivisions=8 width=416 height=416 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1`
Is this a bug or is there any workaround for this problem?