Closed MarkYangjiayi closed 1 year ago
Hi! Thank you for your question. In the experiments with the smaller models (non-LLaMA ones), we have used the meliad codebase, which normalizes keys and queries. This repository contains the code for LLaMA-based models, which do not normalize keys and queries, therefore we do not apply the learnable temperature. However, I see that it is not properly emphasized in the paper. Thanks for spotting it out!
Hi, in section B.2 of the paper it seems that before calculation of the softmax score, a learnable temperature is applied to the attn_weights, but I couldn't seem to find it in the cross_batch_attention code. Was this parameter removed for some reason? Thanks!