RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
The train.py script generates lots of checkpoint files during the training process. This poses a problem, especially when using Google Colab or Google Drive. Specifically, the disk space available is only 15GB, and each checkpoint file ranges from 0.9GB (0.4B model) to 3GB (1.5B model) in size. I suggest optimizing the checkpointing mechanism in train.py in order to reduce the number of generated files.
Suggestion:
Implement rolling checkpoint mechanism (save the checkpoint for the lowest loss)
Add an option to limit the total number of checkpoint files
The
train.py
script generates lots of checkpoint files during the training process. This poses a problem, especially when using Google Colab or Google Drive. Specifically, the disk space available is only 15GB, and each checkpoint file ranges from 0.9GB (0.4B model) to 3GB (1.5B model) in size. I suggest optimizing the checkpointing mechanism intrain.py
in order to reduce the number of generated files. Suggestion: