Validation score inconsistency when using DDP training

EthanMarx commented 6 months ago

When training and validating with multiple GPUs, I have been unable to reproduce the validation scores produced by single-GPU training runs.

I suspect this is in part due to how we split validation amongst GPUs.

Towards the beginning of adopting lightning, I was encountering issues with the training process hanging at the end of the first validation step due to the validation dataloaders being of different lengths on different GPUs. This issue was in part due to the (relatively) complex way we go about splitting validation segments. I resolved this in an extremely hacky way by setting the minimum length of all the GPU validation dataloaders. This is certainly a possible culprit for the disparites in validation scores when training with multi- GPU.

I propose that we simplify how validation data is split amongst GPUs, even at the cost of a decrease in data entropy / quantity.

EthanMarx commented 5 months ago

One way to verify / debug this would be to call trainer.validate() using the same model weights, and specifying different numbers of GPUs via CUDA_VISIBLE_DEVICES

EthanMarx commented 2 months ago

Closing, since I'm realizing there will always be some sort of discrepancy with the way we do timeslides.

Theres no (easy) way to distribute the same exact timeslides across arbitrary number of gpus

ML4GW / aframev2

Validation score inconsistency when using DDP training #124