I use librimix dataset to traing DCCRN by 8gpus
I open early stop in conf
I find the model always stop in very early stage like 10 or 20 epochs
In the log, I find, the val loss is caculated by diffierent gpus and early stop is implemented only by gpu 0, which I think is the reason to very early stop, the log is as follows:
[rank: 5] Metric val_loss improved by 0.433 >= min_delta = 0.0. New best score: -11.178
[rank: 0] Metric val_loss improved by 0.333 >= min_delta = 0.0. New best score: -11.104
[rank: 7] Metric val_loss improved by 0.530 >= min_delta = 0.0. New best score: -10.551
[rank: 4] Metric val_loss improved by 0.408 >= min_delta = 0.0. New best score: -10.931
[rank: 1] Metric val_loss improved by 0.287 >= min_delta = 0.0. New best score: -10.971
[rank: 3] Metric val_loss improved by 0.415 >= min_delta = 0.0. New best score: -11.321
[rank: 2] Metric val_loss improved by 0.418 >= min_delta = 0.0. New best score: -10.858
[rank: 6] Metric val_loss improved by 0.504 >= min_delta = 0.0. New best score: -11.375
Epoch 2, global step 1587: 'val_loss' reached -11.10351 (best -11.10351),
I use librimix dataset to traing DCCRN by 8gpus I open early stop in conf I find the model always stop in very early stage like 10 or 20 epochs In the log, I find, the val loss is caculated by diffierent gpus and early stop is implemented only by gpu 0, which I think is the reason to very early stop, the log is as follows:
[rank: 5] Metric val_loss improved by 0.433 >= min_delta = 0.0. New best score: -11.178 [rank: 0] Metric val_loss improved by 0.333 >= min_delta = 0.0. New best score: -11.104 [rank: 7] Metric val_loss improved by 0.530 >= min_delta = 0.0. New best score: -10.551 [rank: 4] Metric val_loss improved by 0.408 >= min_delta = 0.0. New best score: -10.931 [rank: 1] Metric val_loss improved by 0.287 >= min_delta = 0.0. New best score: -10.971 [rank: 3] Metric val_loss improved by 0.415 >= min_delta = 0.0. New best score: -11.321 [rank: 2] Metric val_loss improved by 0.418 >= min_delta = 0.0. New best score: -10.858 [rank: 6] Metric val_loss improved by 0.504 >= min_delta = 0.0. New best score: -11.375 Epoch 2, global step 1587: 'val_loss' reached -11.10351 (best -11.10351),