^C quit after training data load and tcmalloc large alloc

tcwhalen commented 2 years ago

I successfully trained on the fox dataset, and now I'm trying to train on the firekeeper dataset. I'm running the following script:

python main_nerf.py data/firekeeper --workspace trial_nerf --fp16 --cuda_ray

The data all loads successfully, but after the usual "large alloc" message, the script just quits with an automated ^C

Loading train data: 100% 96/96 [00:17<00:00, 5.45it/s] tcmalloc: large alloc 11498029056 bytes == 0x7f52590bc000 @ 0x7f568599a1e7 0x7f56152510ce 0x7f56152a7cf5 0x7f561535086d 0x7f561535117f 0x7f56153512d0 0x4ba22b 0x7f5615292944 0x58ebef 0x51ae13 0x5b41c5 0x58f49e 0x51837f 0x5b4a3e 0x4ba80a 0x7f5615292944 0x58ebef 0x51ae13 0x5b41c5 0x58f49e 0x51837f 0x5b4a3e 0x4ba80a 0x537e46 0x58ff66 0x51bbc5 0x5b41c5 0x604133 0x606e06 0x606ecc 0x609aa6 ^C

I imagine this is occurring because the firekeeper dataset has more images than the fox dataset, but I'm not sure why the sudden quit or how I might resolve this.

A few notes:

This is on a Google colab notebook. I'm not sure the specs, but I can find a way to find them out if needed. I do know I'm using a P100 GPU with CUDA enabled.
preload is off by default

Any insights are appreciated! Thanks for the cool implementation.

ashawkey commented 2 years ago

@tcwhalen Hi, this seems to be a CPU OOM. You could try to manually modify this downscale value to downscale the image, or preprocess the dataset into lower resolution.

tcwhalen commented 2 years ago

Thank you! This makes sense and is consistent with a similar quit I experienced while trying to export an unreasonably high resolution mesh. I'll close this.

ashawkey / torch-ngp

^C quit after training data load and tcmalloc large alloc #113