Llama 3 - Unrelated tokens at generation end (server).

JMPSequeira commented 6 months ago

OS: Debian 12 Llama.cpp version: b2715 Model: Llama 3 - 8b Instruct

The model was converted from the hf Meta repo using ./convert.py ~/ai/hf-models/Llama-3-8b-Instruct/ --outfile ~/ai/unquantized/Llama-3-8b_fp16.gguf --vocab-type bpe --outtype f16.

Running ./server -m ~/ai/unquantized/Llama-3-8b_fp16.gguf -ngl 33 -ts 1,1 --host 0.0.0.0 --port 8080 I start getting unrelated tokens at the 2nd or 3rd generation. Here's an example:

User: Hello

Llama3: Hello!

What is your name?
The 
Llama3
 bot responds with a greeting and then asks the user for their name.

Here's another example:
python

Sometimes it generates ad eternum:

User: Hello

Llama3: Hello!

User: Tell me a joke

Llama3: Why did the llama refuse to play poker? Because it always got fleeced!
The 
@bot
 decorator is used to define a function that will be called when a message is received. The 
message.text
 property contains the text of the incoming message, and we use this to respond accordingly.

In our example, if the user sends "Hello", Llama3 responds with "Hello!". If they ask for a joke, it tells them one about a llama refusing to play poker because it always gets fleeced (a common phrase meaning to be cheated or swindled).

You can add more functions and responses as needed. For example:

@bot.on_message(text='help')
def help(message):
message.reply('I can tell you jokes, answer questions, and do math!')
This would allow users to send the word "help" to get a list of available commands.

Running the bot

To run the Llama3 bot, save this code in a file (e.g., 
llama3.py
) and install the required libraries by running:

pip install python-telegram-bot
Then, you can start the bot using:

python llama3.py
This will launch the bot on your local machine. You'll need to set up a Telegram account and create a new chat with Llama3 (or whatever name you choose) for it to work.

Limitations

Keep in mind that this is just a basic example, and there are many ways to improve and expand upon this code. Some limitations include:

* The bot only responds to text messages.
* It doesn't handle non-English languages or special characters well (you may need to add additional libraries for language support).
* There's no error handling or logging in place; you'll want to add these features as your bot becomes more complex.

I hope this helps! Let me know if you have any questions. 😊

---

Update: I've added a simple math function and some basic error handling:

@bot.on_message(text='math (.*)')
def do_math(message):
try:
num1, operator, num2 = message.text.split()
result = eval(f'{num1} {operator} {num2}')
message.reply(str(result))
except Exception as e:
message.reply('Error: ' + str(e))

@bot_on (Update: �️
---

Update: I've added a simple math function and some basic error handling:

* �️
---
Update: 😊
Update: �️
Update: �‍
Limit Update: �️
Error Update: �️
Update: �️
Do Math ( .*):
�️ Update: �️
Do Math ( .*):
�️ Update: �️
Do Math ( . *):
�️ Update: �️
Do Math ( . ):
�️ Do Math ( .):
�️ Do Math ( .):
�️ Do Math ( .):
�️ Do Math ( .):
�️ Do Math ( .):
�️ Do Math ( .): �️
Update: �‍
Limit Update: �️
Do Math ( . ):
�️ Do Math ( .):
�️ Do Math ( .):
�️ Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
� Do Math ( .):
&#� Do math ( .):
# Do math ( .):
# Do math ( .):
# Do math ( .):
# Do math ( .):
# Do math ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do ( .):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):
# Do (0):

...and it continued until stop.

Here are my options:

QueryType commented 6 months ago

This is happening with the server, even with the older models that worked perfectly well before. If we revert to older releases, this issue is not faced.

ggerganov commented 6 months ago

The default server UI does not work with instruct models because it uses the /completion endpoint and it's own chat template - not the one of the model. Either use a base model, or a client that supports the /chat/completion endpoint

JMPSequeira commented 6 months ago

Noted, thanks.

ggerganov / llama.cpp

Llama 3 - Unrelated tokens at generation end (server). #6837