Model convergence across Old and New GPU architectures

We have seen a difference in model convergence across old and new GPU architectures. For example,

With latest Nvidia GPUs like V100, A100, RTX 4090 (those that support mixed precision), we see that the same models converge faster even when trained with Single/Full precision (fp32) with batch size 1. Loss ~ 0.007 after 16 hours of training on a 40GB A100 after 120000 epochs (total epochs 300000). However, when the trained with Titan XP cards, this converge much slowly - loss ~0.05 after 120000 epochs. So, this requires more training time.

We need to investigate this further. But this is something one must be careful of as it has been seen that fast converging models may not learn the task!

Please report anything that like this till be we get a chance to look into this further.

Mohinta2892 / catena

Model convergence across Old and New GPU architectures #15