feifeibear / LLMSpeculativeSampling

Fast inference from large lauguage models via speculative decoding
415 stars 46 forks source link

llama 1B performance #16

Closed cliangyu closed 9 months ago

cliangyu commented 9 months ago

Hi Jiarui! Great implementation! I found PY007/TinyLlama-1.1B-intermediate-step-240k-503b generates repetitive words. Maybe that's why speculative sampling doesn't work for me. My script:

python main.py \
    --input "Write a 1000-word essay on the US constitutions" \
    --target_model_name transformers_cache/llama-2-7b-hf \
    --approx_model_name PY007/TinyLlama-1.1B-intermediate-step-240k-503b

https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1 is preferred by the author. However, the tokenizer size of this model is 32001. While the tokenizer size of llama-2-7b-hf and PY007/TinyLlama-1.1B-intermediate-step-240k-503b are 32000. Probably fixing the pad token will help?

cliangyu commented 9 months ago

I realized PY007/TinyLlama-1.1B-Chat-v0.1 also generates repetitive words. So I'm not certain which approx model we should use for llama.

feifeibear commented 9 months ago

Hi Jiarui! Great implementation! I found PY007/TinyLlama-1.1B-intermediate-step-240k-503b generates repetitive words. Maybe that's why speculative sampling doesn't work for me. My script:

python main.py \
    --input "Write a 1000-word essay on the US constitutions" \
    --target_model_name transformers_cache/llama-2-7b-hf \
    --approx_model_name PY007/TinyLlama-1.1B-intermediate-step-240k-503b

https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1 is preferred by the author. However, the tokenizer size of this model is 32001. While the tokenizer size of llama-2-7b-hf and PY007/TinyLlama-1.1B-intermediate-step-240k-503b are 32000. Probably fixing the pad token will help?

Hello Cliang, TinyLlama seems to use llama-1 architecture. I can hardly find a tiny llama-2. I am not sure if two models use different vocab tables could work together. I suggest you pick two models using the same tokenizer.

feifeibear commented 9 months ago

I realized PY007/TinyLlama-1.1B-Chat-v0.1 also generates repetitive words. So I'm not certain which approx model we should use for llama.

First, no matter what kind of approx model used, you can always achieve the same generation distribution as the target model. Second, if you pick an approx model more similar to the target one, the odds of reject sampling will be small. That means more efficient for sampling.

haiduo commented 3 months ago

Hi Jiarui! Great implementation! I found PY007/TinyLlama-1.1B-intermediate-step-240k-503b generates repetitive words. Maybe that's why speculative sampling doesn't work for me. My script:

python main.py \
    --input "Write a 1000-word essay on the US constitutions" \
    --target_model_name transformers_cache/llama-2-7b-hf \
    --approx_model_name PY007/TinyLlama-1.1B-intermediate-step-240k-503b

https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1 is preferred by the author. However, the tokenizer size of this model is 32001. While the tokenizer size of llama-2-7b-hf and PY007/TinyLlama-1.1B-intermediate-step-240k-503b are 32000. Probably fixing the pad token will help?

Hello Cliang, TinyLlama seems to use llama-1 architecture. I can hardly find a tiny llama-2. I am not sure if two models use different vocab tables could work together. I suggest you pick two models using the same tokenizer.

TinyLlama-1.1B adopted exactly the same architecture and tokenizer as Llama 2. ref:https://github.com/jzhang38/TinyLlama/tree/main?tab=readme-ov-file#tinyllama-11b