Open yuanrr opened 3 days ago
Hi Bowen,
According to our experimental results, it would be better to trigger the retracing procedure in early layers. For models with 32 layers, it usually would be 5-16 layers, although most retracing procedures happen at layer 7, 8and 9. Follow this idea, I would suggest you apply the same settings to Qwen2.5-7b model, and you may test if such rule still works for the larger ones.
How does the ending layer, in particular, affect the effect? We believe that retracing in deep layers brings worse effect, as the language model has less processing steps dealing with the retraced information. So, the starting layer and ending layer simply serve as a limit to ensure MemVR is triggered within the expected range.
Should you have any questions please feel free to discuss!
Hello, thank you for your reply! This helped me a lot! I do have some questions and hope to get your help... I have this question because qwen is a 28-layer model, so it might be a little different. When I used the entropy you designed to observe the uncertainty, I found two phenomena: 1: The first 10 layers will remain above 0.9 2: about the 20th layer will suddenly rise above the threshold, even if the entropy of the previous layer is very small I haven't tried it on a 32-layer model, so I don't know if there's a similar phenomenon. In this case, should I search the start layer >10 and the end layer <20?
Hello, thank you for your reply! This helped me a lot! I do have some questions and hope to get your help... I have this question because qwen is a 28-layer model, so it might be a little different. When I used the entropy you designed to observe the uncertainty, I found two phenomena: 1: The first 10 layers will remain above 0.9 2: about the 20th layer will suddenly rise above the threshold, even if the entropy of the previous layer is very small I haven't tried it on a 32-layer model, so I don't know if there's a similar phenomenon. In this case, should I search the start layer >10 and the end layer <20?
Hi! I would recommend you start with layer 5 and see if the result is satisfying. If not, slightly modify the starting layer to deeper layers and try again. Different models do vary in parameter settings, so all you need to do is to conduct some quick evaluations to help you address the optimal values.
But I think starting from layer 10 won't work fine, as that would be too deep for MemVR.
Thank you for your help. I will try according to your suggestion.
Great job! I am currently trying your work on the Qwen2.5 model and would like to ask how I decide on the starting and ending layers? How does the end layer, in particular, affect the effect?