Open LzhinFdu opened 6 months ago
From the readme: "Can train 'infinite' context -- check train.gemma.infini.noclm.1Mseq.sh with 1x H100 80G (with AdamW optimizer, No gradient checkpointing)". However I can train it with 12GB with 8b quantization and a segment size of 400.
I can also run through training. However, the current training results are not very good. I'm trying to train further
I also have same question, can you solve it ?
You can try to adjust the memory retrieval process to the end of 'apply_rotary_pos_emb' and compare the training performance. However, I did not try it further.
Thanks for your response.
Can this retain location information?
I noticed that the memory retrieval and update happens before 'apply_rotary_pos_emb'. Wondering whether the memory lacking location information would confuse the model's perception of the order of historical information?