TorchDSP / torchsig

TorchSig is an open-source signal processing machine learning toolkit based on the PyTorch data handling pipeline.
MIT License
170 stars 38 forks source link

MDB_MAP_FULL trying to generate_sig53 #221

Closed anarkiwi closed 1 month ago

anarkiwi commented 1 year ago

Per the README, I tried this on main (host is Ubuntu 22.03, 384GB RAM, 16T disk - ext4):

What sort of resources are required to generate the dataset (or is there something else going on)? The torchsig repo and "examples" directory are the only things on /local.

josh@worker01:/local$ python3 torchsig/scripts/generate_sig53.py --root=torchsig/examples --all=True

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66250/66250 [59:50<00:00, 18.45it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6625/6625 [04:08<00:00, 26.69it/s] 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 237169/331250 [2:44:20<1:05:11, 24.05it/s] Traceback (most recent call last): File "/local/torchsig/scripts/generate_sig53.py", line 73, in main() File "/home/josh/.local/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/josh/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/josh/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/josh/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/local/torchsig/scripts/generate_sig53.py", line 62, in main generate(root, configs[:4]) File "/local/torchsig/scripts/generate_sig53.py", line 31, in generate creator.create() File "/home/josh/.local/lib/python3.10/site-packages/torchsig/utils/writer.py", line 158, in create self.writer.write(batch) File "/home/josh/.local/lib/python3.10/site-packages/torchsig/utils/writer.py", line 118, in write txn.put( lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached josh@worker01:/local$ josh@worker01:/local$ df -h . Filesystem Size Used Avail Use% Mounted on /dev/sda1 18T 4.8T 12T 29% /local josh@worker01:/local$ free total used free shared buff/cache available Mem: 396137076 2059208 31594060 22688 362483808 391419648 Swap: 0 0 0 josh@worker01:/local$ uname -a Linux worker01 6.2.0-33-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 10:33:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

nsbruce commented 9 months ago

Same error, and I'll add that re-running the generation command doesn't let the generation continue from where it failed. It skips it if the folder exists.

Stborden-NIWC commented 9 months ago

I think this error is related to the number of workers in the generation process. It looks like OP is using 16 workers and they failed 72% through generating the impaired dataset. I was originally using 24 workers and mine failed twice at 48% through the same dataset. I tried again using 1 worker and it made it all the way through (but took forever). I would guess it probably works with 8 or less workers? Can't say why using more workers causes the error, though. Also not sure why it only fails on this specific dataset generation.

Also, yeah, it's frustrating behavior that it skips generation if the folder exists. Would be nice if it verifies data is complete if the folder already exists.

nsbruce commented 9 months ago

Changing the number of workers didn't change anything for me (for the record, I have 40 workers and it always fails at 57%).

From the error, and from some SO posts, I tried changing the map_size value on line 89 of the writer from int(4e12) to int(8e12) and that worked. I tried 8e13 but it errored out saying it couldn't allocate enough space for the map. So this should ideally be a number which is at least as big as the dataset. I'm not sure how 4e12 worked for anyone.

MattCarrickPL commented 1 month ago

Issue 1 year old. Have had significant changes to codebase, including dataset generation, unclear if still a problem. Closing for now.