Fewer Checkpoint Files for train.py

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.05k stars 827 forks source link

The train.py script generates lots of checkpoint files during the training process. This poses a problem, especially when using Google Colab or Google Drive. Specifically, the disk space available is only 15GB, and each checkpoint file ranges from 0.9GB (0.4B model) to 3GB (1.5B model) in size. I suggest optimizing the checkpointing mechanism in train.py in order to reduce the number of generated files. Suggestion:

Implement rolling checkpoint mechanism (save the checkpoint for the lowest loss)
Add an option to limit the total number of checkpoint files

BlinkDL / RWKV-LM

Fewer Checkpoint Files for train.py #138