deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
196 stars 64 forks source link

Appropriate whitespace missing in streaming output for Llama2, Mistral models #1272

Closed Najib-Haq closed 4 weeks ago

Najib-Haq commented 11 months ago

Description

When enabling streaming with Llama2, Mistral models (models using LlamaTokenizer), this doesn't output appropriate white spaces. For example this outputs text like DaenerysistheKhaleesi

Expected Behavior

Streaming output should be Daenerys is the Khaleesi with appropriate spaces

How to Reproduce?

I used the below configuration:

engine=Python
option.dtype=fp16
option.model_id=mistralai/Mistral-7B-Instruct-v0.1
option.tensor_parallel_degree=1
option.enable_streaming=true

What have you tried to solve it?

If I print space after each token output the stream becomes Da ener ys is the Kh ale esi as multiple tokens are mapped to a single word. This might be an issue with LlamaTokenizer. Can be viewed here: https://github.com/huggingface/transformers/issues/22710.

Btw really appreciate all the hard work behind this repository! Thanks a lot!

frankfliu commented 11 months ago

I just test with latest nightly image (deepjavalibrary/djl-serving:deepspeed-nightly), I didn't your issue.

When you enable the token streaming, we use application/jsonlines as default content type, each line is a token:

{"outputs": ["\""]}
{"outputs": ["Green"]}
{"outputs": ["Book"]}
{"outputs": ["\""]}
{"outputs": ["is"]}
{"outputs": ["based"]}
Najib-Haq commented 11 months ago

I am using DJLServing v0.24.0 release via this image: '763104351884.dkr.ecr.eu-central-1.amazonaws.com/djl-inference:0.24.0-deepspeed0.10.0-cu118'. I have adapted the DJL streaming example presented here https://aws.amazon.com/blogs/machine-learning/elevating-the-generative-ai-experience-introducing-streaming-support-in-amazon-sagemaker-hosting/.

My issue is how do I parse the tokens together to show? For application/jsonline my outputs are like this:

b'{"outputs": ["Da"]}'
b'{"outputs": ["en"]}'
b'{"outputs": ["ary"]}'
b'{"outputs": ["s"]}'
b'{"outputs": ["T"]}'
b'{"outputs": ["arg"]}'
b'{"outputs": ["ary"]}'
b'{"outputs": ["en"]}'
b'{"outputs": [","]}'
b'{"outputs": ["also"]}'
b'{"outputs": ["known"]}'
b'{"outputs": ["as"]}'
b'{"outputs": ["Da"]}'
b'{"outputs": ["ener"]}'
b'{"outputs": ["ys"]}'
b'{"outputs": ["Storm"]}'
b'{"outputs": ["born"]}'
b'{"outputs": ["of"]}'
b'{"outputs": ["House"]}'
b'{"outputs": ["T"]}'
b'{"outputs": ["arg"]}'
b'{"outputs": ["ary"]}'
b'{"outputs": ["en"]}'
b'{"outputs": [","]}'
b'{"outputs": ["is"]}'
b'{"outputs": ["a"]}'
b'{"outputs": ["fict"]}'
b'{"outputs": ["ional"]}'
b'{"outputs": ["character"]}'
b'{"outputs": ["in"]}'
b'{"outputs": ["the"]}'
b'{"outputs": ["television"]}'
b'{"outputs": ["series"]}'
b'{"outputs": ["\\""]}'
b'{"outputs": ["Game"]}'
b'{"outputs": ["of"]}'
b'{"outputs": ["Th"]}'
b'{"outputs": ["ron"]}'
b'{"outputs": ["es"]}'
b'{"outputs": ["\\""]}'

So wanted to know how to show the output so it will be coherent for the user? I can't just output a space after each token here.

frankfliu commented 11 months ago

@Najib-Haq

I'm able to reproduce your issue. It seems specific for llama2 and mistral model . Will take a look.

However, to llama2 and mistral model, we recommend to enable rolling batch, it provide much better throughput.

engine=Python
option.dtype=fp16
option.model_id=mistralai/Mistral-7B-Instruct-v0.1
option.tensor_parallel_degree=2
option.rolling_batch=vllm
# uncomment the following line If you want to use application/jsonlines output
# option.output_formatter=jsonlines

For llama2:

engine=MPI
option.model_id=openlm-research/open_llama_7b_v2
onoption.tensor_parallel_degree=2
option.rolling_batch=auto
# option.output_formatter=jsonlines
Najib-Haq commented 11 months ago

@frankfliu

Thanks for the tips!

Regarding the issue, it seems like this is because both llama2 and mistral models are using LlamaTokenizer which is based on sentencepiece tokenizer. This specific tokenizer doesn't seem to output appropriate prefix spaces when decoding token by token. But can make it work if I send the previously generated tokens as well (basically gives appropriate spaces if I send couple of consecutive tokens together). Just to show the idea, in line 217 of streaming_utils.py instead of this:

token_text = tokenizer.decode(input_ids)

I do something like this:

tokens_previous = torch.cat((tokens_previous, input_ids), dim=1) # consider tokens_previous already generated tokens
full_token_text = tokenizer.decode(tokens_previous)
token_text  =  full_token_text[previous_output_length: ] # consider previous_output_length the length of the previous full_token_text
previous_output_length = len(full_token_text) 

Not the most efficient of hacks but it works. Would love to know the actual solution though.

sindhuvahinis commented 4 weeks ago

@Najib-Haq We have deprecated streaming and our rolling batch provides wide ranges of models both with vllm and lmi-dist engines. lmi-dist and vllm should not have the prefix spaces issues. Feel free to try this out and we have a lmi whole section doc here https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html.

Closing this issue as it is stale, Feel free to open a new one if you have any more questions.