Closed benjaminklein closed 3 years ago
only 4 out of 8 GPUs are used
The launch is default as python -m torch.distributed.launch --nproc_per_node=4
. Changing --nproc_per_node=4
to --nproc_per_node=8
will make the program work on 8 GPUs.
NaN loss
Can you paste your screen log here? Or can you check whether some videos lose? get raw video error, skip it.
Thank you for the quick reply. Indeed the nproc_per_node was responsible for the n_gpu.
Here is the log:
05/01/2021 01:10:46 - INFO - ***** Running test *****
05/01/2021 01:10:46 - INFO - Num examples = 1000
05/01/2021 01:10:46 - INFO - Batch size = 16
05/01/2021 01:10:46 - INFO - Num steps = 63
05/01/2021 01:10:46 - INFO - ***** Running val *****
05/01/2021 01:10:46 - INFO - Num examples = 1000
05/01/2021 01:11:01 - INFO - ***** Running training *****
05/01/2021 01:11:01 - INFO - Num examples = 180000
05/01/2021 01:11:01 - INFO - Batch size = 128
05/01/2021 01:11:01 - INFO - Num steps = 14060
get raw video error, skip it.
get raw video error, skip it.
05/01/2021 01:17:49 - INFO - Epoch: 1/5, Step: 50/2812, Lr: 0.000000004,
Loss: nan, Time/step: 8.152810
On Fri, Apr 30, 2021 at 6:01 PM ArrowLuo @.***> wrote:
1.
only 4 out of 8 GPUs are used The launch is default as python -m torch.distributed.launch --nproc_per_node=4. Changing --nproc_per_node=4 to --nproc_per_node=8 will make the program work on 8 GPUs. 2.
NaN loss Can you paste your screen log here? Or can you check whether some videos lose? get raw video error, skip it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ArrowLuo/CLIP4Clip/issues/4#issuecomment-830478840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ4TJ7KJLVERIHDV6FO3M3TLNHGLANCNFSM435L4PIA .
@benjaminklein There is get raw video error, skip it.
from the log. It may be the reason that caused the NAN loss, and filter them could solve the issue. I have updated the print (https://github.com/ArrowLuo/CLIP4Clip/blob/master/dataloaders/dataloader_msrvtt_retrieval.py#L286). Update your code version and rerun to find out these lost videos, filter them by hard code in init() or modify the CSV file.
Solved! Thank you @ArrowLuo
Hi,
Thanks for sharing the code. I'm getting NaN loss on the first epoch, and also although I have 8 gpus only 4 seem to be used.