fieldsoftheworld / ftw-baselines

Code for running baseline models/experiments with the Fields of The World dataset
https://fieldsofthe.world/
MIT License
53 stars 4 forks source link

Out of memory #61

Open firmanhadi21 opened 1 week ago

firmanhadi21 commented 1 week ago

I tried to follow the Austira example but failed. The error message is RuntimeError: DataLoader worker (pid 4071827) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit. FYI, I have access to a server with 8 GPU. Is it possible to do the inference using more than 1 available GPU(s)?

calebrob6 commented 1 week ago

Hi @firmanhadi21 -- I sometimes get this issue on GPU servers that are running jupyter hubs and the workaround I use is running sudo mount -o size=300000000000 -o nr_inodes=1000000 -o noatime, nodiratime -o remount /dev/shm. I don't think this a fields of the world specific problem though. You could also checkout https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396/47 for some suggestions.

firmanhadi21 commented 1 week ago

@calebrob6 Thank you very much for your prompt reply. I tried your suggestion, but it failed. I think it is because I don't have access to the system administration.

calebrob6 commented 1 week ago

As an immediate workaround, you can try manually setting num_workers in the dataloader to 0 (or 1) to make dataloading happen on the main thread. Are you able to use dataloaders in other PyTorch code with num_workers>1 on your server (if so then we likely need to fix something here)?