BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
799 stars 61 forks source link

How to modify `preprocess_bunny` for `qwen-1.5-1.8b-chat` #84

Closed linhaojia13 closed 1 month ago

linhaojia13 commented 1 month ago

Qwen1.5-1.8B/config.json:

  "bos_token_id": 151643,
  "eos_token_id": 151643,

Qwen1.5-1.8B-Chat/config.json:

  "bos_token_id": 151643,
  "eos_token_id": 151645,

This difference cause the condition if tokenizer.pad_token_id == tokenizer.eos_token_id: in preprocess_bunny is different, which can be seen as follow:

def preprocess_bunny(
        sources,
        tokenizer: transformers.PreTrainedTokenizer,
        has_image: bool = False
) -> Dict:
    conv = conversation_lib.default_conversation.copy()
    roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

    # Apply prompt templates
    conversations = []
    for i, source in enumerate(sources):
        if roles[source[0]["from"]] != conv.roles[0]:
            # Skip the first one if it is not from human
            source = source[1:]

        conv.messages = []
        for j, sentence in enumerate(source):
            role = roles[sentence["from"]]
            assert role == conv.roles[j % 2], f"{i}"
            conv.append_message(role, sentence["value"])
        conversations.append(conv.get_prompt())

    # Tokenize conversations

    if has_image:
        input_ids = torch.stack(
            [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
    else:
        input_ids = tokenizer(
            conversations,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        ).input_ids

    targets = input_ids.clone()

    assert conv.sep_style == conversation_lib.SeparatorStyle.TWO

    # Mask targets
    sep = conv.sep + conv.roles[1] + ": "
    for conversation, target in zip(conversations, targets):
        total_len = int(target.ne(tokenizer.pad_token_id).sum())

        rounds = conversation.split(conv.sep2)
        cur_len = 0
        end_token_cnt = 0

        for i, rou in enumerate(rounds):
            if rou == "":
                break

            parts = rou.split(sep)
            if len(parts) != 2:
                break
            parts[0] += sep

            if has_image:
                round_len = len(tokenizer_image_token(rou, tokenizer))
                instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
            else:
                round_len = len(tokenizer(rou).input_ids)
                instruction_len = len(tokenizer(parts[0]).input_ids) - 1

            round_len += 1
            end_token_cnt += 1

            target[cur_len: cur_len + instruction_len] = IGNORE_INDEX

            cur_len += round_len
        target[cur_len:] = IGNORE_INDEX

        if tokenizer.pad_token_id == tokenizer.eos_token_id:
            cur_len -= end_token_cnt
        if cur_len < tokenizer.model_max_length:
            if cur_len != total_len:
                target[:] = IGNORE_INDEX
                print(
                    f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
                    f" (ignored)"
                )

    return dict(
        input_ids=input_ids,
        labels=targets,
    )

If I want to modify preprocess_bunny for qwen-1.5-1.8b-chat, it seems that there are two ways: 1) delete round_len += 1 ; or 2) delete if tokenizer.pad_token_id == tokenizer.eos_token_id and make cur_len -= end_token_cnt always be done.

I'm not sure which way is correct and will not bring potential errors.

Isaachhh commented 1 month ago

The easiest way is to change <|endoftext|> to <|im_end|> here. However, conv_bunny is used for phi-1.5, phi-2, stalelm-2 and qwen1.5, so this change would invalidate the compatibility with other models.

You may define a new conv_qwen_chat and pay attention to all usages like conv_mode and version.

linhaojia13 commented 1 month ago

Thank you very much!