dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
1.9k stars 416 forks source link

local_llm: looks like utf8 characters generation is not supported? #455

Open ksvladimir opened 3 months ago

ksvladimir commented 3 months ago

I'm trying to make the model generate emojis using this command:

./run.sh $(./autotag local_llm) python3 -m local_llm.chat --api=mlc --model=NousResearch/Llama-2-7b-chat-hf --prompt="Repeat this twice: 😀"

Unfortunately, the result is this:

>> PROMPT: Repeat this twice: 😀

 Sure, here are the answers to your questions:

���

������</s>

I tried with other models and different prompts - I can never make the model output a smiley. Perhaps a utf-8 issue?

dusty-nv commented 3 months ago

Hmm I know in the llamaspeak webUI, the emojis work (Llama is friendly and likes to output them haha) I don't think on the console, emojis are expected to render correctly as characters? I'm not super familiar with terminal encodings like that, but I'd you get it figured out some setting to enable them in the terminal, let me know. IIRC i set utf-8 in the base dockerfiles


From: Volodymyr Kuznetsov @.> Sent: Thursday, March 28, 2024 1:22:32 AM To: dusty-nv/jetson-containers @.> Cc: Subscribed @.***> Subject: [dusty-nv/jetson-containers] local_llm: looks like utf8 characters generation is not supported? (Issue #455)

I'm trying to make the model generate emojis using this command:

./run.sh $(./autotag local_llm) python3 -m local_llm.chat --api=mlc --model=NousResearch/Llama-2-7b-chat-hf --prompt="Repeat this twice: 😀"

Unfortunately, the result is this:

PROMPT: Repeat this twice: 😀

Sure, here are the answers to your questions:

���

������

I tried with other models and different prompts - I can never make the model output a smiley. Perhaps a utf-8 issue?

— Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/jetson-containers/issues/455, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGK2N5T7FM6EUSVSAOYDY2OSJRAVCNFSM6AAAAABFMEILCOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIYTEMZYGEYDONA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ksvladimir commented 3 months ago

You're right, it's actually working in the webui.

In the terminal, I don't think it's an encoding issue (I can print emojis just fine). My guess would be is that it's a tokenization issue. If I understand the code correctly, it does look like it's decoding one token at a time when streaming? Most of the emojis are single utf-8 character, yet are represented as multiple tokens (because GPT tokenizers operate on bytes rather than characters). When tokenizer.decode() sees a token sequence that's still incomplete and can't be decoded to a valid utf-8 character, it's always returning '\ufffd' character (which indicated decoding error) — so the model ends up printing a sequence of '\ufffd`.

dusty-nv commented 3 months ago

Ahh okay, yes - in the web UI, it resends the entire chat history every time, so it captures the re-tokenization that occurs.

On the terminal, it prints it out token-by-token, but does not go back and re-print. Try using python3 -m local_llm.chat --disable-streaming option and see if it prints out the emoji in the final text.

The StreamingResponse iterator returned from generate() also has the output_text member which stores the properly detokenized text so far: https://github.com/dusty-nv/jetson-containers/blob/1a7fb07dd6183be744784a1c418799c1e0796ca3/packages/llm/local_llm/chat/stream.py#L35

ksvladimir commented 3 months ago

Unfortunately disable-streaming doesn't help because it still simply concatenates tokens from the stream: https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm/models/mlc.py#L404 .

Perhaps a better solution is to not return incomplete tokens from StreamingResponse? That's how e.g., exllamav2 handles it. Huggingface handles it by only returning complete words while streaming, which is another viable option.

dusty-nv commented 3 months ago

Ah you are correct on the first point - that was from before I discovered this issue too. For now, try keeping streaming mode on, and then at the end of the generation print(stream.output_text)

That should be correct because of how the output is continually de-tokenized when new tokens are added: https://github.com/dusty-nv/jetson-containers/blob/1a7fb07dd6183be744784a1c418799c1e0796ca3/packages/llm/local_llm/chat/stream.py#L57

ksvladimir commented 3 months ago

Yes, I got it to work that way, thank you!