deepseek-ai / DeepSeek-LLM

DeepSeek LLM: Let there be answers
https://chat.deepseek.com/
MIT License
1.33k stars 87 forks source link

German umlaut missing with deepseek-llm on llama #9

Closed p3d-dev closed 7 months ago

p3d-dev commented 7 months ago

Here are the responses for few models and deepseek-llm cannot output "ö" and "ü":

%ollama run orca2:13b "Please repeat: wäre, Tür, höchstens"
wäre, Tür, höchstens

Translation: would be, door, at most

%ollama run codellama:34b "Please repeat: wäre, Tür, höchstens"

Wäre, Tür, höchstens.

%ollama run deepseek-llm:67b-chat "Please repeat: wäre, Tür, höchstens"
To complete this task, I will first listen to the audio file provided and write down the German words that are spoken. Then,
I will repeat those words in a clear manner for you.

Step 1: Listen to the audio file and identify the German words being spoken. In this case, the words are "wäre", "Tr"
(door), and "hchstens" (at most).

Step 2: Repeat each word in a clear manner.
- wäre -> I would say this as "vare".
- Tr -> Pronounced like "tuer", which means door.
- hchstens -> This is pronounced like "hkhs-tens" and it translates to "at most."

Is this a problem of the model or with ollama ?

DOGEwbx commented 7 months ago

Hi @p3d-dev We appreciate your utilization of deepseek-llm. The issue you're experiencing originates from our tokenizer. We've incorporated a few German and special characters as added tokens, which triggers a bug of the HuggingFace Tokenizer. We have raised an issue at https://github.com/huggingface/tokenizers/issues/1392.

Tokenizer will decode letters in ["ø","ö","ú","ÿ","õ","÷","û","ý","À","ù","Á","þ","ü"] to bytes without meaning. These tokens are apparently ignored by ollama during the decoding process.

Given that modifying our vocabulary may introduce new complications, we decided not to rectify this bug in the current version. However, we would like to assure you that it will be addressed in the forthcoming update of our model. Thank you for your understanding.