CStanKonrad / long_llama

LongLLaMA is a large language model capable of handling long contexts. It is based on OpenLLaMA and fine-tuned with the Focused Transformer (FoT) method.
Apache License 2.0
1.45k stars 85 forks source link

Where is the learnable temperature parameter in cross_batch_attention? #21

Closed MarkYangjiayi closed 11 months ago

MarkYangjiayi commented 11 months ago

Hi, in section B.2 of the paper it seems that before calculation of the softmax score, a learnable temperature is applied to the attn_weights, but I couldn't seem to find it in the cross_batch_attention code. Was this parameter removed for some reason? Thanks!

CStanKonrad commented 11 months ago

Hi! Thank you for your question. In the experiments with the smaller models (non-LLaMA ones), we have used the meliad codebase, which normalizes keys and queries. This repository contains the code for LLaMA-based models, which do not normalize keys and queries, therefore we do not apply the learnable temperature. However, I see that it is not properly emphasized in the paper. Thanks for spotting it out!