NaN Loss and Only 4 GPUs out of 8 Are Used

benjaminklein commented 3 years ago

Hi,

Thanks for sharing the code. I'm getting NaN loss on the first epoch, and also although I have 8 gpus only 4 seem to be used.

ArrowLuo commented 3 years ago

only 4 out of 8 GPUs are used The launch is default as python -m torch.distributed.launch --nproc_per_node=4. Changing --nproc_per_node=4 to --nproc_per_node=8 will make the program work on 8 GPUs.
NaN loss Can you paste your screen log here? Or can you check whether some videos lose? get raw video error, skip it.

benjaminklein commented 3 years ago

Thank you for the quick reply. Indeed the nproc_per_node was responsible for the n_gpu.

Here is the log:


05/01/2021 01:10:46 - INFO -   ***** Running test *****

05/01/2021 01:10:46 - INFO -     Num examples = 1000

05/01/2021 01:10:46 - INFO -     Batch size = 16

05/01/2021 01:10:46 - INFO -     Num steps = 63

05/01/2021 01:10:46 - INFO -   ***** Running val *****

05/01/2021 01:10:46 - INFO -     Num examples = 1000

05/01/2021 01:11:01 - INFO -   ***** Running training *****

05/01/2021 01:11:01 - INFO -     Num examples = 180000

05/01/2021 01:11:01 - INFO -     Batch size = 128

05/01/2021 01:11:01 - INFO -     Num steps = 14060

get raw video error, skip it.

get raw video error, skip it.

05/01/2021 01:17:49 - INFO -   Epoch: 1/5, Step: 50/2812, Lr: 0.000000004,
Loss: nan, Time/step: 8.152810

On Fri, Apr 30, 2021 at 6:01 PM ArrowLuo @.***> wrote:

1.

only 4 out of 8 GPUs are used The launch is default as python -m torch.distributed.launch --nproc_per_node=4. Changing --nproc_per_node=4 to --nproc_per_node=8 will make the program work on 8 GPUs. 2.

NaN loss Can you paste your screen log here? Or can you check whether some videos lose? get raw video error, skip it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ArrowLuo/CLIP4Clip/issues/4#issuecomment-830478840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ4TJ7KJLVERIHDV6FO3M3TLNHGLANCNFSM435L4PIA .

ArrowLuo commented 3 years ago

@benjaminklein There is get raw video error, skip it. from the log. It may be the reason that caused the NAN loss, and filter them could solve the issue. I have updated the print (https://github.com/ArrowLuo/CLIP4Clip/blob/master/dataloaders/dataloader_msrvtt_retrieval.py#L286). Update your code version and rerun to find out these lost videos, filter them by hard code in init() or modify the CSV file.

benjaminklein commented 3 years ago

Solved! Thank you @ArrowLuo

ArrowLuo / CLIP4Clip

NaN Loss and Only 4 GPUs out of 8 Are Used #4