I'm running the training script which is the main.py file. I have 2 A40 GPUs each having 46GB of memory. I reduced the batch size to even 1. When I set the num_workers to 0, the code just abruptly stops at Epoch 0. If I set it to some value, then it throws the error of "RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 32823, 32919) exited unexpectedly".
I saw the solution to this online and people said to put num_workers to 0 but that again doesn't solve the problem as stated above. Can you please tell what is the issue?
Edit - Value of parameter accumulate_grad_batches = 1 in my case. Should I change it to 4?
Hello authors,
I'm running the training script which is the main.py file. I have 2 A40 GPUs each having 46GB of memory. I reduced the batch size to even 1. When I set the num_workers to 0, the code just abruptly stops at Epoch 0. If I set it to some value, then it throws the error of "RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 32823, 32919) exited unexpectedly".
I saw the solution to this online and people said to put num_workers to 0 but that again doesn't solve the problem as stated above. Can you please tell what is the issue?
Edit - Value of parameter accumulate_grad_batches = 1 in my case. Should I change it to 4?