Open AugF opened 10 months ago
There are multiple factors that affect the loss value, including data format, tokenizer vocabulary and learning rates. RWKV World models are using a different tokenizer vocabulary which is larger than RWKV Pile, so it contributes to a higher loss. Also, your data might differ from the data on which the model was trained, which also results in a higher loss.
基于world-chinese的1.5b的ckpt增量训练loss绝对值达到2.74,远大于RWKV pile图中332b的2.0的loss。想问问是怎么回事?