headless server session and multiple gpus

ttsesm commented 2 years ago

Hi guys,

Thanks for sharing your work. I've tried to run the code on a headless server session with multiple gpus but for some reason I am getting an out of memory error:

python scripts/run.py --scene lego
15:58:13 INFO     Loading NeRF dataset from
15:58:13 INFO       /home/ttsesm/dev/instant-ngp/data/nerf/nerf_synthetic/lego/transforms_val.json
15:58:13 INFO       /home/ttsesm/dev/instant-ngp/data/nerf/nerf_synthetic/lego/transforms_train.json
15:58:13 INFO       /home/ttsesm/dev/instant-ngp/data/nerf/nerf_synthetic/lego/transforms_test.json
15:58:13 SUCCESS  Loaded 400 images of size 800x800 after 0s
15:58:13 INFO       cam_aabb=[min=[0.5,0.5,0.5], max=[0.5,0.5,0.5]]
15:58:13 INFO     Loading network config from: /home/ttsesm/dev/instant-ngp/configs/nerf/base.json
15:58:13 INFO     GridEncoding:  Nmin=16 b=1.38191 F=2 T=2^19 L=16
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
15:58:13 INFO     Density model: 3--[HashGrid]-->32--[FullyFusedMLP(neurons=64,layers=3)]-->1
15:58:13 INFO     Color model:   3--[SphericalHarmonics]-->16+16--[FullyFusedMLP(neurons=64,layers=4)]-->3
15:58:13 INFO       total_encoding_params=12196240 total_network_params=9728
Training:   0%|                                                                                                                                                                                             | 0/100000 [00:00<?, ?step/s]
Traceback (most recent call last):
  File "/home/ttsesm/dev/instant-ngp/scripts/run.py", line 172, in <module>
    while testbed.frame():
RuntimeError: Could not allocate memory: CUDA Error: cudaMalloc(&rawptr, n_bytes+DEBUG_GUARD_SIZE*2) failed with error out of memory

Any idea what could be wrong? I've compiled the project with the gui flag off -DNGP_BUILD_WITH_GUI=OFF and as you can see from the screenshot above all my gpus are fully available.

Thanks.

Tom94 commented 2 years ago

This program needs a lot of memory. What will be probably be sufficient to fit into 11 gigs is if you load just the training images lego/transforms_train.json rather than all images/transforms in the folder.

ttsesm commented 2 years ago

This program needs a lot of memory. What will be probably be sufficient to fit into 11 gigs is if you load just the training images lego/transforms_train.json rather than all images/transforms in the folder.

But I was able to load the same setup on another computer with even less memory though a newer gpu (2080) :thinking:

Can I somehow take advantage of the multi-gpu setup?

Tom94 commented 2 years ago

The reason you could load it on the newer GPU is because it supports efficient half-precision arithmetic on TensorCores. Older GPUs need to run full precision to be efficient, which unfortunately increases memory usage by quite a bit.

As for multi-GPU support: there is none at this point in time, sorry.

ttsesm commented 2 years ago

Ok, I see. I've managed to have it running only with the training images as you suggested. Thanks.

For having the resolution of an extracted mesh at 1024x1024x1024 do you know how much memory approx. would be sufficient? In general if I want to get the best result which would be the suggested gpu to go with?

Would a tesla V100/32Gb be a good start?

ttsesm commented 2 years ago

@Tom94 in a multi gpu system can I set somehow in which gpu to be used for running the application?

pwais commented 2 years ago

@ttsesm use CUDA_VISIBLE_DEVICES=0 mycommand to let mycommand only see GPU 0. note that sometimes the gpu numbering doesn't match the numbering in nvidia-smi.. usually it does but not for me

Tom94 commented 2 years ago

Closing due to the (now) much lower memory usage & multi-gpu info in the FAQ

amughrabi commented 6 months ago

For a record, I encountered the same issue when I re-plugged the laptop charger. The only solution for this is to restart the laptop. For instance, this error appears when I run i-ngp on a scene, remove the charger, and plug it back again. It seems the OS stopped/killed some services used to save the battery.

NVlabs / instant-ngp

headless server session and multiple gpus #85