Closed deepdad closed 2 years ago
Putting it somewhere here:
# time.sleep(10)
dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training)
# time.sleep(10)
dataloader = torch.utils.data.DataLoader(
gives
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
data = self.data_queue.get(timeout=timeout)
File "/usr/lib/python3.6/queue.py", line 173, in get
self.not_empty.wait(remaining)
File "/usr/lib/python3.6/threading.py", line 299, in wait
gotit = waiter.acquire(True, timeout)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6164) is killed by signal: Killed.
It seems more like it runs out of GPU memory, although it has a GB to spare. 6541MiB
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 926, in __del__
self._shutdown_workers()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 906, in _shutdown_workers
w.join()
File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 19535) is killed by signal: Killed.
Traceback (most recent call last):
File "train.py", line 107, in <module>
loss, outputs = model(imgs, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/x/PyTorch-YOLOv3/models.py", line 259, in forward
x, layer_loss = module[0](x, targets, img_dim)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/x/PyTorch-YOLOv3/models.py", line 188, in forward
ignore_thres=self.ignore_thres,
File "/home/x/PyTorch-YOLOv3/utils/utils.py", line 317, in build_targets
class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCTensorMathCompareT.cuh:69
Have the same problem.
Have the same problem.
Is this issue still relevant/occurring?
I am symlinking images due to large file sizes. I am new to this repo. Maybe a sleep() call will help but I have to figure out where (is there a client / server here?)
cf. https://stackoverflow.com/questions/13921669/python-multiprocessing-socket-error-errno-111-connection-refused