facebookresearch / FAMBench

Benchmarks to capture important workloads.
Apache License 2.0
28 stars 23 forks source link

Disable SCALE_LR in CvT #110

Closed jataylo closed 1 year ago

jataylo commented 1 year ago

NaN's are produced in CvT model at large number of devices (>64 GPUs) due to the scaling LR. This PR disables the scaling LR by default for benchmark purposes.

Previously we reduced the overall LR to alleviate this issue https://github.com/facebookresearch/FAMBench/commit/58f68431660fd84d1b7fdcaa4b6925dc12fb3bd3 but large learning rate still becomes an issue when the world size is large.

amathews-amd commented 1 year ago

LGTM