ayaanzhaque / instruct-nerf2nerf

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions (ICCV 2023)
https://instruct-nerf2nerf.github.io/
MIT License
792 stars 70 forks source link

about multi-gpu acceleration #11

Closed flyingshan closed 1 year ago

flyingshan commented 1 year ago

Hello, Following the document, I set the flag "--pipeline.ip2p-device cuda:1" to utilize two GPUs to train the in2n model. But I found that the training speed did not improve. Is this normal? Hoping for your advice!

ayaanzhaque commented 1 year ago

Hi!

I'm not sure why training speed did not improve. Could you share some screenshots of your nvidia-smi on the two GPUs as well as the training logs from nerfstudio? That'll help us figure it out.

flyingshan commented 1 year ago

I noticed that during the training process, the value of the GPU-Util of the two GPU increases from 0 to 100 alternatively, which suggests there is just one GPU working at the same time. For example: Time 1: |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:36:00.0 Off | 0 | | N/A 33C P0 61W / 250W | 9267MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCI... Off | 00000000:37:00.0 Off | 0 | | N/A 57C P0 247W / 250W | 12997MiB / 40960MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

Time 2: |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:36:00.0 Off | 0 | | N/A 38C P0 242W / 250W | 13231MiB / 40960MiB | 99% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCI... Off | 00000000:37:00.0 Off | 0 | | N/A 48C P0 63W / 250W | 12997MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

And here are some logs from the nerfstudio: Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec  -----------------------------------------------------------------------------------  30020 (100.07%) 1 s, 16.704 ms 23 h, 59 m, 40 s 277.02 K  30030 (100.10%) 961.287 ms 23 h, 59 m, 32 s 286.20 K  30040 (100.13%) 957.287 ms 23 h, 59 m, 22 s 296.96 K  30050 (100.17%) 953.794 ms 23 h, 59 m, 13 s 292.95 K  30060 (100.20%) 957.893 ms 23 h, 59 m, 3 s 281.41 K  30070 (100.23%) 953.860 ms 23 h, 58 m, 54 s 287.69 K  30080 (100.27%) 952.990 ms 23 h, 58 m, 44 s 285.33 K  30090 (100.30%) 957.117 ms 23 h, 58 m, 34 s 282.15 K  30100 (100.33%) 957.164 ms 23 h, 58 m, 25 s 289.42 K  30110 (100.37%) 960.392 ms 23 h, 58 m, 15 s 286.27 K 

I installed in2n inside the official docker image of nerfstudio.

ayaanzhaque commented 1 year ago

Ah yes sorry, they do only run alternately, so it won't speed up training by running the two processes in parallel. The reason we say it will increase speed is if your GPU util is near max, it slows down overall training. Hopefully that makes sense.