mpt-7b-ggml generating garbled characters

taiyou2000 commented 1 year ago

I tried to use mpt-7b-ggml-q5_1(https://huggingface.co/TheBloke/MPT-7B-GGML) with koboldcpp(commit hash: e6ddb15c3a8) on Ubuntu 22.04. It was fine with generating English alphabet but when it comes to characters in languages other than English, it's generating garbled characters like this:

��
��都市
��都

And the terminal is showing: gpt_tokenize: unknown token '�'

I also tried to run mpt with pytorch in colab and my computer but both encountered OOM error so I can't tell if this is whether ggml or pytorch/transformers side issue. But I think this is ggml side issue. I suspected this is caused by misconfiguration of encoding in terminal. But it was UTF-8(ja_JP.UTF-8) and it is unlikely caused by terminal encoding. https://github.com/ggerganov/ggml caused same result.

It seems like similar issue was discussed in early llama.cpp repository https://github.com/ggerganov/llama.cpp/pull/73

h3ndrik commented 1 year ago

I think there is still something wrong with mpt. I also had the larger MPT-30b output weird characters here and there. ~~And yesterday koboldcpp kept crashing with a segmentation fault while generating close to and beyond 2048 tokens. The prompt ingestion worked fine.~~ Also i'm not sure how intelligent it's supposed to be. Either there is something wrong with it, or it's way less 'good' than a llama based model. (EDIT: The segmentation fault is my own fault: I missed the parameter --contextsize. Maybe we need to check if the user gives contradictory values on cli and koboldai lite.)

e576082c commented 1 year ago

Interesting! I think I had a similar problem with TheBloke/PULI-GPT-3SX-GGML (puli-gpt-3sx.ggmlv1.q8_0.bin).

Generates garbled characters, gives unknown token '�' error, and generally can only generate nonsense, no matter of temperature or sampling preset used. I decided it's not worth the effort, so I have given up on using it.

LostRuins / koboldcpp

mpt-7b-ggml generating garbled characters #272