This PR implements a feature and then does some little clean up things.
The main feature implemented is checkpointing every N-hours (can be less than 1). It sets a defualt None or zero, establishes a precedence to users and gives explicit warnings to users if they set more than one checkpointing standard (i.e. # of train steps).
While I was working in the training and model congiguration sections of the yaml, I got annoyed flipping between them to look at accumulate_grad_batches and batch_size. They should really be right next to each other in the training_configuration section, so I moved it.
I also alphabetized a some ordering in the lm_eval script for consistency and added a non-existant attribute to the decoderconfig class to get past this error:
ValueError… The config you are passing has a 'model_type' attribute that is not consistent with the model type you passed (config has and you passed custom_transformer. Fix one of those so they match!
We have it in the RetNetConfig, but not in the DecoderConfig for whatever reason. Now it's in both.
This PR implements a feature and then does some little clean up things.
The main feature implemented is checkpointing every N-hours (can be less than 1). It sets a defualt None or zero, establishes a precedence to users and gives explicit warnings to users if they set more than one checkpointing standard (i.e. # of train steps).
While I was working in the training and model congiguration sections of the yaml, I got annoyed flipping between them to look at
accumulate_grad_batches
andbatch_size
. They should really be right next to each other in thetraining_configuration
section, so I moved it.I also alphabetized a some ordering in the lm_eval script for consistency and added a non-existant attribute to the
decoderconfig
class to get past this error:ValueError… The config you are passing has a 'model_type' attribute that is not consistent with the model type you passed (config has and you passed custom_transformer. Fix one of those so they match!
We have it in the RetNetConfig, but not in the DecoderConfig for whatever reason. Now it's in both.