ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.42k stars 9.38k forks source link

BOS ,and EOS tokens #7057

Closed walidbet18 closed 2 months ago

walidbet18 commented 4 months ago

Hi everyone ! I have a question it might be dumb but i want to understand\

llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 ''

i know and understand what does these tokens means , to be honest i undertand that by translation tasks , but for taks like question/answer i don't understand how they works because sometimes the answer is very wide then the question , so how it works and can i modify them in llama.cpp and with what criteria ?

Jeximo commented 4 months ago

i don't understand how they works because sometimes the answer is very wide

Hi. BOS means beginning of sentence, and EOS means end of sentence. Usually they're special tokens in the model for llama.cpp text generation.

llama.cpp automatically inserts a BOS token for the most part. As for EOS tokens, it depends on the model. Here's an example:

./main ~/model.gguf -cml -p "What's 5+5?"

-cml automatically fills in both the BOS and EOS token for the prompt(BOS token before What's, EOS token after 5?), assuming it's a chatml model.

walidbet18 commented 4 months ago

@Jeximo thanks for your answer , i understand that but what i'm trying to do here is to fine-tune my model using a text file similar to this "function1(int , string ,bool) -> none this method take bool int and string as parametres ,function2() takes no arguments ..... etc " i'm just wondering how the model would know where to stop if i'll ask him to return function1 method , how would he know that he have just return "function1(int , string ,bool) -> none this method take bool int and string as parametres" and not all the text

this is why i end up wondering if BOS and EOS would give me an idea

arnfaldur commented 4 months ago

Your question seems to be around dataset creation. As I understand it, a dataset consists of multiple snippets of text like you describe of various sizes. During training, the snippets are surrounded by EOS, BOS tokens and concatenated and then fed through the model.

I suggest that you close this issue as it's not really an issue related to llama.cpp. You can definitely find good resources on dataset creation and LLM training techniques somewhere on the internet.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.