Open 2015aroras opened 1 month ago
I'm in favor of adding trainer tests but I have a couple concerns about this approach, the foremost being that I think comparing activations is way too brittle and strict. It would be enough to just check that loss is decreasing over a few steps.
we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.
I wouldn't assume training would or even should be identical across platforms.
The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...
I think this is a lot actually especially considering new versions of these may have to committed over time.
Issue: If we make a backward incompatible change or a regression, we don't have a mechanism to catch it. Also, if we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.
Fix: Add training regression tests that run 2 steps of training and compare the model activations against an already-prepared set of model activations from beaker. The saved model activations can be updated by changing the flags passed to the tests.
The added tests also run on CPU, but this required making minor changes to OLMo training code. Autocast works differently on CPU, so the model activations are different for CPU compared to GPU.
The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...