DRAGNLabs / 301r_retnet

2 stars 1 forks source link

Ckpt by time #67

Closed DrewGalbraith closed 3 months ago

DrewGalbraith commented 3 months ago

This PR implements a feature and then does some little clean up things.

The main feature implemented is checkpointing every N-hours (can be less than 1). It sets a defualt None or zero, establishes a precedence to users and gives explicit warnings to users if they set more than one checkpointing standard (i.e. # of train steps).

While I was working in the training and model congiguration sections of the yaml, I got annoyed flipping between them to look at accumulate_grad_batches and batch_size. They should really be right next to each other in the training_configuration section, so I moved it.

I also alphabetized a some ordering in the lm_eval script for consistency and added a non-existant attribute to the decoderconfig class to get past this error:

ValueError… The config you are passing has a 'model_type' attribute that is not consistent with the model type you passed (config has and you passed custom_transformer. Fix one of those so they match! We have it in the RetNetConfig, but not in the DecoderConfig for whatever reason. Now it's in both.