Closed theodorblackbird closed 10 months ago
self.ffn(y)
should be self.ffn(self.ln2(y))
.
What is the performance gap? For smaller model, the D_model is small, and then D_head is small. We would suggest using a small number of head, e.g., 1, to make D_head >= 64.
self.ffn(y) should be self.ffn(self.ln2(y)).
I didn't get it. Interestingly this alone seems to make it competitive. I will keep investigating now.
Thank you.
Seems that we missed something in the paper. I checked our code implementation and it has two layernorms like Transformers. I am interested in the exact number on the performance gap. Our smallest experimental scale is 350m. My general sense is that for smaller model, token mixing is more important, so you might want to try out the parameter allocation used in RetNet, i.e., D_k = D_model. D_v=2*D_model. Our finding is that for large-scale model, allocating more parameters to FFNs is more important.
Thank you for this amazing work,
I'm trying to include your work as a drop-in replacement of some other SSM such as Mamba and RWKV. Note that I train significantly smaller models (from 20M to 60M params), not related to natural language generation. However I got encouraging results and I believe GLA should be competitive, but so far I fail to match RWKV/Mamba, despite promising speed/VRAM usage.
I have multiple question in order to integrate GLA correctly :