try of implementing cross-layer attention

ReaLLMASIC / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

MIT License

23 stars 17 forks source link

try of implementing cross-layer attention #181

Open Hrancheng opened 2 months ago

Hrancheng commented 2 months ago

Not sure whether this is correctly implemented, but a group of 4 attention layers(blocks) will use exactly the same K and V matrixes. We will work later to set this "4" to a user input argument.

gkielian commented 2 months ago

Looks good!

There are some merge conflicts from recent updates.

I'll create a pull request from my gkielian repo to Hrancheng:CLA_test for weaving the latest changes in; PR will probably be arriving Thursday afternoon ish.

gkielian commented 2 months ago

Update on time estimate, will be starting a PR for the edits tomorrow, and seems I'll probably have the PR for review on Monday

Hrancheng commented 2 months ago

Looks great!

My review has some additional configuration settings and fields to add, afterwards should be ready to merge.

On a side note, if you can create slides about the approach and sections of the code modified to the team -- specifically the for loop with the block and high level of what was modified ( gpt_conf configuration param, train.py argparse, model.py attention and block) -- could be a good chance to show an in depth modification process.

Yes! I'll create slides along with codes modified next time to make this clear

gkielian commented 3 weeks ago

@Hrancheng reviving interest in this due to potential hardware intersection, let's discuss early next week : )