Zero123 not working on A40 GPU 46GB ram

Bhavay-2001 commented 10 months ago

Hello authors,

I'm running the training script which is the main.py file. I have 2 A40 GPUs each having 46GB of memory. I reduced the batch size to even 1. When I set the num_workers to 0, the code just abruptly stops at Epoch 0. If I set it to some value, then it throws the error of "RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 32823, 32919) exited unexpectedly".

I saw the solution to this online and people said to put num_workers to 0 but that again doesn't solve the problem as stated above. Can you please tell what is the issue?

Edit - Value of parameter accumulate_grad_batches = 1 in my case. Should I change it to 4?

kalyani7195 commented 9 months ago

I am getting this error too -- any idea why did that happen?

cooperrfeng commented 9 months ago

me too, I am also getting this error -- anyone know how to fix it?

Genshin-Impact-king commented 8 months ago

have you fix it?

cvlab-columbia / zero123

Zero123 not working on A40 GPU 46GB ram #87