flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.6k stars 219 forks source link

SpecInfer generate '<pad>' #1301

Open dutsc opened 5 months ago

dutsc commented 5 months ago

My machine configuration is 4*3090, and my example prompt is: please introduce Kobe Bryant, who played basketball in NBA. I use three SSMs, all of which are opt-125M. Only when the LLM uses opt-13b, the generated text looks It's normal until it gets up, as follows: 13b

When I use smaller LLMs (opt-6.7b, opt-1.3b), the generated text is all . 6 7b 1 3b

why is that?

My script is as follows: (in the directory /workspace/Flexflow/build/). The prompt.json is "please introduce Kobe Bryant, who played basketball in NBA".

./inference/spec_infer/spec_infer \
    -ll:gpu 4 \
    -ll:fsize 22000 \
    -ll:zsize 30000 \
    -llm-model /models/opt-13b/ \
    -ssm-model /models/opt-125m/ \
    -ssm-model /models/opt-125m/ \
    -ssm-model /models/opt-125m/ \
    -prompt /workspace/FlexFlow/prompts/prompt.json \
    -tensor-parallelism-degree 4 \
    --fusion > ../sclog/spec_infer.log

Thank you very much for your valuable time.

xinhaoc commented 4 months ago

@dutsc Hi! We have demonstrated using one ssm can achieve best performance in our latest version paper. And there is an assertion in the at here to make sure only one ssm is registered. Please make sure you are using the newest code. Please tell me if you still get the incorrect output.