Closed jataylo closed 1 year ago
NaN's are produced in CvT model at large number of devices (>64 GPUs) due to the scaling LR. This PR disables the scaling LR by default for benchmark purposes.
Previously we reduced the overall LR to alleviate this issue https://github.com/facebookresearch/FAMBench/commit/58f68431660fd84d1b7fdcaa4b6925dc12fb3bd3 but large learning rate still becomes an issue when the world size is large.
LGTM
NaN's are produced in CvT model at large number of devices (>64 GPUs) due to the scaling LR. This PR disables the scaling LR by default for benchmark purposes.
Previously we reduced the overall LR to alleviate this issue https://github.com/facebookresearch/FAMBench/commit/58f68431660fd84d1b7fdcaa4b6925dc12fb3bd3 but large learning rate still becomes an issue when the world size is large.