HazyResearch / zoology

Understand and test language model architectures on synthetic tasks.
Apache License 2.0
164 stars 28 forks source link

Pile experiment #17

Closed elephantmipt closed 9 months ago

elephantmipt commented 10 months ago

Hello, thank you for the interesting paper and blog post. I am currently trying to reproduce the experiment with the Pile dataset.

Could you please clarify the architecture of the Based model used in this experiment? In the MQAR task, you utilized two layers: BaseConv and Based. Does this also apply to the Pile dataset? Should I use 6 combinations of BaseConv-Based or simply replace Attention with Based for each of the 12 layers (assuming a 125M parameters setup)? Furthermore, I noticed that you usually set feature_dim=16 for most of the experiments. If I replace Attention with Based, should I also increase the number of layers in the model? Lastly, could you kindly share the configuration file for the Based model used in the Pile experiment?

elephantmipt commented 10 months ago

As far as I understand, there are 8 heads each with 16 hidden dimensions. Nevertheless, the number of layers is still unclear. I can imagine two variants here: there is a standard hidden size in the feed forward layer (model_dim 4) and therefore 18 layers to match 150M parameters, or there is a smaller hidden size (model_dim 2) as it was in hyena-slim to bias FLOPS towards the sequence mixer layer.

simran-arora commented 9 months ago

Hi, we have released training code for Based now -- please reopen if you have remaining questions! https://github.com/HazyResearch/based