ashleve / lightning-hydra-template

PyTorch Lightning + Hydra. A very user-friendly template for ML experimentation. ⚡🔥⚡
3.85k stars 600 forks source link

wandb log contains duplicated logs #505

Open chanwkimlab opened 1 year ago

chanwkimlab commented 1 year ago

I am using this repo with wandb logger. However, upon checking the logs on the wandb website, I've noticed that there are many duplicate lines. Can you assist me in tracking down the cause of this issue?

Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 9, global step 3730: 'val/criteria' was not in top 1
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
`Trainer.fit` stopped: `max_epochs=10` reached.
Restoring states from the checkpoint path at /projects/leelab2/chanwkim/dermatology_datasets/logs/train/runs/2023-01-14_21-21-44/checkpoints/epoch_004.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Loaded model weights from checkpoint at /projects/leelab2/chanwkim/dermatology_datasets/logs/train/runs/2023-01-14_21-21-44/checkpoints/epoch_004.ckpt
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
Epoch 0/9  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/467 0:00:00 • -:--:-- 0.00it/s loss: nan v_num: uhkz val/loss: 4.883 val/criteria: -4.883 val/criteria_best: -4.883 
ashleve commented 1 year ago

The cause might be a different training process running on each of your GPUs. I suspect for 4 devices you will have 4 times as much logs.

I'm not knowledgable about how wandb handles logging in multi-gpu setup so I can't really help

tlwzzy commented 1 year ago

The cause might be a different training process running on each of your GPUs. I suspect for 4 devices you will have 4 times as much logs.

I'm not knowledgable about how wandb handles logging in multi-gpu setup so I can't really help

I'm using wandb logger in a single GTX 1080 Ti still have the same question. I'm trying to discover it.