val loss in distribute training

LiuSiQi-TJ commented 1 year ago

I use librimix dataset to traing DCCRN by 8gpus I open early stop in conf I find the model always stop in very early stage like 10 or 20 epochs In the log, I find, the val loss is caculated by diffierent gpus and early stop is implemented only by gpu 0, which I think is the reason to very early stop, the log is as follows:

[rank: 5] Metric val_loss improved by 0.433 >= min_delta = 0.0. New best score: -11.178 [rank: 0] Metric val_loss improved by 0.333 >= min_delta = 0.0. New best score: -11.104 [rank: 7] Metric val_loss improved by 0.530 >= min_delta = 0.0. New best score: -10.551 [rank: 4] Metric val_loss improved by 0.408 >= min_delta = 0.0. New best score: -10.931 [rank: 1] Metric val_loss improved by 0.287 >= min_delta = 0.0. New best score: -10.971 [rank: 3] Metric val_loss improved by 0.415 >= min_delta = 0.0. New best score: -11.321 [rank: 2] Metric val_loss improved by 0.418 >= min_delta = 0.0. New best score: -10.858 [rank: 6] Metric val_loss improved by 0.504 >= min_delta = 0.0. New best score: -11.375 Epoch 2, global step 1587: 'val_loss' reached -11.10351 (best -11.10351),

LiuSiQi-TJ commented 1 year ago

I set CUDA_VISIBLE_DEVICES = 0,1,2,3,4,5,6,7 in run.sh, did I do something wrong?

mpariente commented 1 year ago

Hello,

I would say you did not do anything wrong. What is your version of lightning ?

asteroid-team / asteroid

val loss in distribute training #674