Closed Nivek92 closed 2 months ago
Your understanding is correct and we use the same template for all bunny models for consistency and convenience. The performance is also acceptable.
Also why do bunny and phi3 preprocess differently?
Some models' tokenizer would prepend a bos_token
at the begining of the sentence.
bunny itself is not a base model, so the name here is somewhat misleading. What base model does bunny refer to?
We admit that it may be consufing. We support phi-1.5, phi-2, stablelm-2, qwen-1.5 at first, all not prepending bos
and with same eos_token
. So, at first, all models are with version bunny
. However, minicpm, phi-3 and llama-3 would prepend bos
and have different eos_token
, so we define three more conversation modes.
Also check here.
I'm looking through the code to see how different dataset formats besides the default LLAVA dataset format could be supported e.g. [{"question": "...", "answer": "..."}]
Doing so I wonder if the input string is constructed correctly, considering the example chat format for microsoft/Phi-3-mini-4k-instruct is
From my understanding the input string is created in https://github.com/BAAI-DCAI/Bunny/blob/main/bunny/conversation.py and it looks like the input string looks something like
USER: ...###ASSISTANT: ...<endoftext>USER: ...###ASSISTANT: ...<endoftext>
Is my understanding here correct? Was is it a deliberate choice to not use the format as recommended by the original model and if so why?
I would imagine the performance would be better if the recommended chat formatting would be used instead.
Edit:
Also why do
bunny
andphi3
preprocess differently?bunny
itself is not a base model, so the name here is somewhat misleading. What base model doesbunny
refer to?