Closed EthanMarx closed 2 months ago
One way to verify / debug this would be to call trainer.validate()
using the same model weights, and specifying different numbers of GPUs via CUDA_VISIBLE_DEVICES
Closing, since I'm realizing there will always be some sort of discrepancy with the way we do timeslides.
Theres no (easy) way to distribute the same exact timeslides across arbitrary number of gpus
When training and validating with multiple GPUs, I have been unable to reproduce the validation scores produced by single-GPU training runs.
I suspect this is in part due to how we split validation amongst GPUs.
Towards the beginning of adopting lightning, I was encountering issues with the training process hanging at the end of the first validation step due to the validation dataloaders being of different lengths on different GPUs. This issue was in part due to the (relatively) complex way we go about splitting validation segments. I resolved this in an extremely hacky way by setting the minimum length of all the GPU validation dataloaders. This is certainly a possible culprit for the disparites in validation scores when training with multi- GPU.
I propose that we simplify how validation data is split amongst GPUs, even at the cost of a decrease in data entropy / quantity.