Open firmanhadi21 opened 1 week ago
Hi @firmanhadi21 -- I sometimes get this issue on GPU servers that are running jupyter hubs and the workaround I use is running sudo mount -o size=300000000000 -o nr_inodes=1000000 -o noatime, nodiratime -o remount /dev/shm
. I don't think this a fields of the world specific problem though. You could also checkout https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/47 for some suggestions.
@calebrob6 Thank you very much for your prompt reply. I tried your suggestion, but it failed. I think it is because I don't have access to the system administration.
As an immediate workaround, you can try manually setting num_workers in the dataloader to 0 (or 1) to make dataloading happen on the main thread. Are you able to use dataloaders in other PyTorch code with num_workers>1 on your server (if so then we likely need to fix something here)?
I tried to follow the Austira example but failed. The error message is
RuntimeError: DataLoader worker (pid 4071827) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
FYI, I have access to a server with 8 GPU. Is it possible to do the inference using more than 1 available GPU(s)?