SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.93k stars 409 forks source link

Seems not support long prompt well. #74

Open swankong opened 9 months ago

swankong commented 9 months ago

We noticed that the paper mentioned limited performance improvement for relatively long prompt situations, but our situation is that, in the case of very long prompts, it seems PowerInfer ceases to work, generating no output. This situation occurred with llama-7B, llama-13B, and llama-70B-q4. We speculate that this might be because when the prompt is very long, there are many neurons that need to work simultaneously, and the performance optimization done using the locality principle no longer applies, causing PowerInfer's working mechanism to be ineffective. We would like to hear your perspective on this matter.

YixinSong-e commented 9 months ago

Thank you for showing interest in our project. Indeed, when the input prompt is overly long, the time required to generate the first token increases, but this doesn't impact the generation phase, which still benefits from a noticeable acceleration. The slowdown in generating the first token is influenced by two main factors: accumulated sparsity and the performance of operators. First, during prompt processing, multiple tokens are processed in parallel, leading to an accumulation of neuron activations that reduce the applicability of locality. Your understanding of this aspect is correct. Second, the current GPU operators we use for prompt processing are not efficient, and improving their performance is a focus of our ongoing optimization efforts.

In fact, in the tests conducted for our paper, when the input length exceeds 80 (an empirical threshold), we switch to dense computation for the prompt phase and then revert back to powerinfer's computation graph in the generation phase. This approach is somewhat faster but also adds complexity to the code. Consequently, we have temporarily removed this implementation from the open-source version. We are currently exploring how best to balance these considerations.

Additionally, could you provide the input length settings of your experiment? This information would greatly aid us in conducting relevant tests and implementing fixes. Thank you very much! You might also consider reducing the -b parameter to check if powerinfer is actively computing or experiencing a bottleneck during the prompt phase.

swankong commented 9 months ago

Hi,

Here the test details: I used the /llama-13b-relu.powerinfer.gguf model. My hareware env is a GTX 4080 16G VRAM GPU, Intel core-i7 12700F 12cores processor with 32GB memory. [1] The frist prompt is: What's the python programming language. The command line is as the following: ./main -m /data/dev-work/models/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf --color --multiline-input --keep -1 --chunks -1 --draft 64 -c 0 -n 128 -t 8 --vram-budget -1 -p "What's the python programming language?"

As for the first prompt test, the model seems to work normally.

[2] Regarding to the second prompt, I copy some words from wikipedia about llama2 and let the model summarize the content. Here's the command line and resutls. Barly no outputs. I'm not sure this issue is by misusing (command arguments, prompt method, etc) or by the model / inference engine itself.

./main -m /data/dev-work/models/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf --color --multiline-input --keep -1 --chunks -1 --draft 64 -c 0 -n 128 -t 8 --vram-budget -1 -p "Please summarize the content between the angle brackets <> within 10 words. <On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA. Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters. The model architecture remains largely unchanged from that of LLaMA-1 models, but 40% more data was used to train the foundational models. The accompanying preprint also mentions a model with 34B parameters that might be released in the future upon satisfying safety targets. LLaMA-2 includes both foundational models and models fine-tuned for dialog, called LLaMA-2 Chat. In further departure from LLaMA-1, all models are released with weights, and are free for many commercial use cases. However, due to some remaining restrictions, the description of LLaMA as open source has been disputed by the Open Source Initiative (known for maintaining the Open Source Definition). In November 2023, research conducted by Patronus AI, an artificial intelligence startup company, compared performance of LLaMA-2, OpenAI's GPT-4 and GPT-4-Turbo, and Anthropic's Claude2 on two versions of a 150-question test about information in SEC filings (e.g. Form 10-K, Form 10-Q, Form 8-K, earnings reports, earnings call transcripts) submitted by public companies to the agency where one version of the test required the generative AI models to use a retrieval system to locate the specific SEC filing to answer the questions while the other version provided the specific SEC filing to the models to answer the question (i.e. in a long context window). On the retrieval system version, GPT-4-Turbo and LLaMA-2 both failed to produce correct answers to 81% of the questions, while on the long context window version, GPT-4-Turbo and Claude-2 failed to produce correct answers to 21% and 24% of the questions respectively. Llama 2 foundational models were trained on a data set with 2 trillion tokens. This data set was curated to remove Web sites that often disclose personal data of people. It also upsamples sources considered trustworthy. Llama 2 - Chat was additionally fine-tuned on 27,540 prompt-response pairs created for this project, which performed better than larger but lower-quality third-party datasets. For AI alignment, reinforcement learning with human feedback (RLHF) was used with a combination of 1,418,091 Meta examples and seven smaller datasets. The average dialog depth was 3.9 in the Meta examples, 3.0 for Anthropic Helpful and Anthropic Harmless sets, and 1.0 for five other sets, including OpenAI Summarize, StackExchange, etc. > "

At 2023-12-25 09:44:06, "Jeremy Song" @.***> wrote:

Thank you for showing interest in our project. Indeed, when the input prompt is overly long, the time required to generate the first token increases, but this doesn't impact the generation phase, which still benefits from a noticeable acceleration. The slowdown in generating the first token is influenced by two main factors: accumulated sparsity and the performance of operators. First, during prompt processing, multiple tokens are processed in parallel, leading to an accumulation of neuron activations that reduce the applicability of locality. Your understanding of this aspect is correct. Second, the current GPU operators we use for prompt processing are not efficient, and improving their performance is a focus of our ongoing optimization efforts.

In fact, in the tests conducted for our paper, when the input length exceeds 80 (an empirical threshold), we switch to dense computation for the prompt phase and then revert back to powerinfer's computation graph in the generation phase. This approach is somewhat faster but also adds complexity to the code. Consequently, we have temporarily removed this implementation from the open-source version. We are currently exploring how best to balance these considerations.

Additionally, could you provide the input length settings of your experiment? This information would greatly aid us in conducting relevant tests and implementing fixes. Thank you very much! You might also consider reducing the -b parameter to check if powerinfer is actively computing or experiencing a bottleneck during the prompt phase.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hodlen commented 9 months ago

Hi @swankong thanks for your feedback! I just tested the same model with the same arguments as you did in [2], and here is the full generation result:

62 million dialogs were generated in total using both Llama 2 - Chat and Llama 2, with the dialogs being used to evaluate RLHF models.

Regardless of the quality, I think it works on long prompt. To be specific, it processes prompt tokens in a batch of 128 and generate as usual. How is the result on your end? I think it might be the model's ability that leads to barely no output, as you stated.