Closed yisuanwang closed 5 months ago
same problem
Hi
I guess you were setting too large num_workers. Try to reduce it.
I trained on a single gpu. I set num_workers to 1 and I still get this error. The other question is, does the code automatically download the data set when I run main.py (I get the error "No space left on device" when I run it) ?
The problem doesn't seem to have anything to do with num_workers, it's a problem with the training data I generated myself. Now I have successfully trained. Thanks.
Has anyone had this problem? It occurs when training tests on the original partial objaverse dataset. I'm using 8*A100 80G gpu. config stays stock. Used a truncated version of valid_path.json, which has about 4000 objects. \
File "/data/xxx/.conda/zero123/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 3360242, 3361527, 3362405) exited unexpectedly