BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
799 stars 61 forks source link

Wrong Chat Format #74

Closed Nivek92 closed 2 months ago

Nivek92 commented 2 months ago

I'm looking through the code to see how different dataset formats besides the default LLAVA dataset format could be supported e.g. [{"question": "...", "answer": "..."}]

Doing so I wonder if the input string is constructed correctly, considering the example chat format for microsoft/Phi-3-mini-4k-instruct is

I am going to Paris, what should I see?<|end|>
<|assistant|>
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:\n\n1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.\n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.\n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.\n\nThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."<|end|>
<|user|>
What is so great about #1?<|end|>
<|assistant|>

From my understanding the input string is created in https://github.com/BAAI-DCAI/Bunny/blob/main/bunny/conversation.py and it looks like the input string looks something like

USER: ...###ASSISTANT: ...<endoftext>USER: ...###ASSISTANT: ...<endoftext>

Is my understanding here correct? Was is it a deliberate choice to not use the format as recommended by the original model and if so why?

I would imagine the performance would be better if the recommended chat formatting would be used instead.

Edit:

Also why do bunny and phi3 preprocess differently? bunny itself is not a base model, so the name here is somewhat misleading. What base model does bunny refer to?

if conversation_lib.default_conversation.version == "bunny":
        return preprocess_bunny(sources, tokenizer, has_image=has_image)
    elif conversation_lib.default_conversation.version in {"minicpm", "llama", "phi3"}:
        return preprocess_bunny_with_bos(sources, tokenizer, has_image=has_image)
Isaachhh commented 2 months ago

Your understanding is correct and we use the same template for all bunny models for consistency and convenience. The performance is also acceptable.

Isaachhh commented 2 months ago

Also why do bunny and phi3 preprocess differently?

Some models' tokenizer would prepend a bos_token at the begining of the sentence.

bunny itself is not a base model, so the name here is somewhat misleading. What base model does bunny refer to?

We admit that it may be consufing. We support phi-1.5, phi-2, stablelm-2, qwen-1.5 at first, all not prepending bos and with same eos_token. So, at first, all models are with version bunny. However, minicpm, phi-3 and llama-3 would prepend bos and have different eos_token, so we define three more conversation modes.

Also check here.