LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.99k stars 3.23k forks source link

Tokenizers padding_side was not validate to be "right" in trainer_sft.py #3657

Open theblackcat102 opened 1 year ago

theblackcat102 commented 1 year ago
from transformers import AutoTokenizer
AutoTokenizer.from_pretrained("OpenAssistant/llama2-13b-orca-8k-3319").padding_side
>> 'left'
AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-fp16")
>> 'left'
AutoTokenizer.from_pretrained("mosaicml/mpt-7b").padding_side
>> 'right'
AutoTokenizer.from_pretrained("huggyllama/llama-7b").padding_side
>> 'left'
AutoTokenizer.from_pretrained("OpenAssistant/llama-30b-sft-v8.2-2.4k-steps-system").padding_side
>> 'left'

Since llama models are using left padding, the supervised training dialoguecollator would cause the label_mask to pad in a different direction as the tokenizer.pad (input_ids, attention_mask), as torch.stack (label_mask) implements the right padding strategy.

Printing out the dataloader results in trainer_sft.py would also verify the issue

    train_dataloader = DataLoader(train, collate_fn=train_collate_fn, batch_size=9, shuffle=True)
    for batch in train_dataloader:
        for idx, question in enumerate(batch['input_ids']):
            print('-------')
            print(tokenizer.decode(question[batch['label_masks'][idx]]).replace('</s>', '')+'\n')

I think there's no padding_side assigned to right in the trainer_sft.py pipeline, so by default llama models we have trained are bit faulty

theblackcat102 commented 1 year ago

An easy fix would be setting padding_side = 'left' in DialogueDataCollator __post_init__ function

@dataclass
class DialogueDataCollator:
    ...
    def __post_init__(self):
        assert self.tokenizer.eos_token
        self.tokenizer.padding_side = 'right'