Closed YerongLi closed 6 months ago
Hi! We just released the FlashAttention implementation with transformers==4.38.2. You may try it on LongBench.
About the results reported in our paper, we use: 1: Deepspeed Inference to save memory and accelerate. 2: The patch for llama-2-7b-chat with transformers==4.32 3: We don't use the chat template of LongBench, all the models are tested with plain input.
Hi how did you evaluate on LongBench?
I tried to map your LLama to extended version with https://github.com/datamllab/LongLM/blob/6b841932d5267e610a65eb228923e16746270dce/llama_example.py#L40 it generates OOM with 2 A100-80GB with dataparallel?
And I used the generation flow from longBench https://github.com/THUDM/LongBench/blob/main/pred.py without extendedforward, one model takes 60GB memory.