datamllab / LongLM

[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
https://arxiv.org/pdf/2401.01325.pdf
MIT License
597 stars 59 forks source link

OOM on LongBench #27

Closed YerongLi closed 6 months ago

YerongLi commented 6 months ago

Hi how did you evaluate on LongBench?

I tried to map your LLama to extended version with https://github.com/datamllab/LongLM/blob/6b841932d5267e610a65eb228923e16746270dce/llama_example.py#L40 it generates OOM with 2 A100-80GB with dataparallel?

And I used the generation flow from longBench https://github.com/THUDM/LongBench/blob/main/pred.py without extendedforward, one model takes 60GB memory.

Mooler0410 commented 6 months ago

Hi! We just released the FlashAttention implementation with transformers==4.38.2. You may try it on LongBench.

About the results reported in our paper, we use: 1: Deepspeed Inference to save memory and accelerate. 2: The patch for llama-2-7b-chat with transformers==4.32 3: We don't use the chat template of LongBench, all the models are tested with plain input.