Closed flyingshan closed 1 year ago
Hi!
I'm not sure why training speed did not improve. Could you share some screenshots of your nvidia-smi on the two GPUs as well as the training logs from nerfstudio? That'll help us figure it out.
I noticed that during the training process, the value of the GPU-Util of the two GPU increases from 0 to 100 alternatively, which suggests there is just one GPU working at the same time. For example: Time 1: |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:36:00.0 Off | 0 | | N/A 33C P0 61W / 250W | 9267MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCI... Off | 00000000:37:00.0 Off | 0 | | N/A 57C P0 247W / 250W | 12997MiB / 40960MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
Time 2: |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:36:00.0 Off | 0 | | N/A 38C P0 242W / 250W | 13231MiB / 40960MiB | 99% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCI... Off | 00000000:37:00.0 Off | 0 | | N/A 48C P0 63W / 250W | 12997MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
And here are some logs from the nerfstudio: Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec [0m ----------------------------------------------------------------------------------- [0m 30020 (100.07%) 1 s, 16.704 ms 23 h, 59 m, 40 s 277.02 K [0m 30030 (100.10%) 961.287 ms 23 h, 59 m, 32 s 286.20 K [0m 30040 (100.13%) 957.287 ms 23 h, 59 m, 22 s 296.96 K [0m 30050 (100.17%) 953.794 ms 23 h, 59 m, 13 s 292.95 K [0m 30060 (100.20%) 957.893 ms 23 h, 59 m, 3 s 281.41 K [0m 30070 (100.23%) 953.860 ms 23 h, 58 m, 54 s 287.69 K [0m 30080 (100.27%) 952.990 ms 23 h, 58 m, 44 s 285.33 K [0m 30090 (100.30%) 957.117 ms 23 h, 58 m, 34 s 282.15 K [0m 30100 (100.33%) 957.164 ms 23 h, 58 m, 25 s 289.42 K [0m 30110 (100.37%) 960.392 ms 23 h, 58 m, 15 s 286.27 K [0m
I installed in2n inside the official docker image of nerfstudio.
Ah yes sorry, they do only run alternately, so it won't speed up training by running the two processes in parallel. The reason we say it will increase speed is if your GPU util is near max, it slows down overall training. Hopefully that makes sense.
Hello, Following the document, I set the flag "--pipeline.ip2p-device cuda:1" to utilize two GPUs to train the in2n model. But I found that the training speed did not improve. Is this normal? Hoping for your advice!