ReaLLMASIC / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
23 stars 17 forks source link

a draft of gated attention and its modification #195

Open Hrancheng opened 1 month ago

Hrancheng commented 1 month ago

In the code of original paper they directly multiply attention(softmax output) by gated linear layer, so I modified accordingly boxplot can be drawn by setting graph_type = "boxplot" or graph_type = "all"