Closed elephantmipt closed 9 months ago
As far as I understand, there are 8 heads each with 16 hidden dimensions. Nevertheless, the number of layers is still unclear. I can imagine two variants here: there is a standard hidden size in the feed forward layer (model_dim 4) and therefore 18 layers to match 150M parameters, or there is a smaller hidden size (model_dim 2) as it was in hyena-slim to bias FLOPS towards the sequence mixer layer.
Hi, we have released training code for Based now -- please reopen if you have remaining questions! https://github.com/HazyResearch/based
Hello, thank you for the interesting paper and blog post. I am currently trying to reproduce the experiment with the Pile dataset.
Could you please clarify the architecture of the Based model used in this experiment? In the MQAR task, you utilized two layers: BaseConv and Based. Does this also apply to the Pile dataset? Should I use 6 combinations of BaseConv-Based or simply replace Attention with Based for each of the 12 layers (assuming a 125M parameters setup)? Furthermore, I noticed that you usually set
feature_dim=16
for most of the experiments. If I replace Attention with Based, should I also increase the number of layers in the model? Lastly, could you kindly share the configuration file for the Based model used in the Pile experiment?