charent / ChatLM-mini-Chinese

中文对话0.2B小模型(ChatLM-Chinese-0.2B),开源所有数据集来源、数据清洗、tokenizer训练、模型预训练、SFT指令微调、RLHF优化等流程的全部代码。支持下游任务sft微调,给出三元组信息抽取微调示例。
Apache License 2.0
1.13k stars 135 forks source link

用train.py出现shape的mismatch #36

Closed huluk98 closed 4 months ago

huluk98 commented 6 months ago

在本地用train.py pretrain 的时候出现 Accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All Shapes across devices myst be valid. Input shapes: -Process 0:[16,174] -Process1:[16,167]

charent commented 6 months ago
  1. 检查accelerate、transformers版本看是否为requirements.txt的版本。
  2. 你训练的batch做padding没?看看是不是每条样本都是174的长度
huluk98 commented 6 months ago

谢谢大神回复,我应该是忘了做padding.

huluk98 commented 6 months ago

大神我还是没搞清楚训练的batch的padding在哪个步骤实现的.

charent commented 6 months ago

我的代码是在collect_fn函数做的padding,dataset.py#L102

def collate_fn(self, data: list[list]) -> dict:
        '''
        合并一个批次数据返回
        '''
        tokenizer = self.tokenizer

        prompt = tokenizer([item[0] for item in data], padding=True, return_token_type_ids=False)
        response = tokenizer([item[1] for item in data], padding=True, return_token_type_ids=False)

        input_ids = array(prompt.input_ids, dtype=int64)
        input_mask = array(prompt.attention_mask, dtype=int64)
        target_ids = array(response.input_ids, dtype=int64)

        ret = {
            'input_ids': LongTensor(input_ids),
            'input_mask': LongTensor(input_mask),
            'target_ids': LongTensor(target_ids),
        }
        return ret
huluk98 commented 6 months ago

很奇怪,在dataset.py里面没有任何问题,就是一到pretrain evaluation step 就会出现shape mismatch.

charent commented 5 months ago

你可以像这样检查一下你的数据及迭代的shape有没有问题:

if __name__ == '__main__':
    parquet_file = PROJECT_ROOT + '/data/my_valid_dataset.parquet'
    tokenizer_dir = PROJECT_ROOT + '/model_save/tokenizer'

    # example 1:
    dataset = MyDataset(parquet_file, tokenizer_dir, keep_in_memory=False, max_seq_len=128)
    print('\nexample 1, dataset size: ', len(dataset))
    dataloader = DataLoader(dataset, batch_size=32, collate_fn=dataset.collate_fn)

    for epoch in range(2):
        print('epoch: {}'.format(epoch))
        for step, batch in enumerate(dataloader):
            x, x_mask, y = batch['input_ids'], batch['input_mask'], batch['target_ids']
            print('step:{}'.format(step), x.shape, x_mask.shape, y.shape)
            if step == 5:
                break

    # exit(0)
    # example 2:
    dataset = ParquetDataset(parquet_file, tokenizer_dir, keep_in_memory=True, max_len=32)
    dataloader = DataLoader(dataset['train'], batch_size=32, collate_fn=dataset.collate_fn)
    print('\nexample 2, dataset size: ', dataset.get_dataset_size('train'))

    for epoch in range(2):
        print('epoch: {}'.format(epoch))
        for step, batch in enumerate(dataloader):
            x, x_mask, y = batch['input_ids'], batch['input_mask'], batch['target_ids']
            print('step:{}'.format(step), x.shape, x_mask.shape, y.shape)
            if step == 5:
                break
huluk98 commented 5 months ago

大神是这样,我用您的tokenizer 以及test 的三个dataset 在用multigpu的情况下一致出现报错,但是用单个gpu的时候不会出现问题。就是tensorshape的问题。

huluk98 commented 5 months ago

在用单个gpu训练test datsets 的时候还出现跑完第7个epoch之后直接退出.

charent commented 5 months ago

token_to_id部分的代码你加个截断看看,因为我这边在清洗数据的时候就把长度限定。

charent commented 5 months ago

也可以参考这个:issues/37,这几个问题好像啊,都是数据集有些样本的长度太长了。