OpenLMLab / LOMO

LOMO: LOw-Memory Optimization
MIT License
975 stars 68 forks source link

数据集问题 #18

Open wanghao-007 opened 1 year ago

wanghao-007 commented 1 year ago

大佬,看到你们的论文很兴奋,让贫穷党看到了曙光。 有个问题啊,自定义数据集的话,怎么加入进去啊?我手上有一个llama-alpaca格式的指令数据集。

KaiLv69 commented 1 year ago

你好,感谢你对lomo的关注。 如果想使用自定义数据集,可以参考src/mydatasets.pysrc/utils.py里的DataCollatorForCauselLM()进行修改。如果要进行评测,需要把参数里的predict_with_generate设置为True并且传给trainer对应的compute_metrics()。这里有类似的issue: #8 。

lileishitou commented 1 year ago

File "/home/ec2-user/SageMaker/LOMO/src/lomo.py", line 114, in func if self.loss_scaler and self.loss_scaler.has_overflow_serial or self.loss_scaler._has_inf_or_nan(p.grad): AttributeError: 'NoneType' object has no attribute '_has_inf_or_nan' , 换成了 llama-alpaca, 报上述错误。

KaiLv69 commented 1 year ago

File "/home/ec2-user/SageMaker/LOMO/src/lomo.py", line 114, in func if self.loss_scaler and self.loss_scaler.has_overflow_serial or self.loss_scaler._has_inf_or_nan(p.grad): AttributeError: 'NoneType' object has no attribute '_has_inf_or_nan' , 换成了 llama-alpaca, 报上述错误。

你好,可能是条件判断的问题,请参考这个commit: https://github.com/OpenLMLab/LOMO/commit/24cde8e91feac437809bf7790f4727623dce6a76 如果仍然有问题,请您附上更详细的配置和代码。

lileishitou commented 1 year ago

按照这个commit 训练, 修改eval方法后, 但是训练占用显存仍然过高, 尤其是7b的 max_input_len=1024 都会显存溢出(4卡24G A10 全部用上), 33b的 max_input_len = 256 也会显存溢出(40G显存*6卡全能用上)。

KaiLv69 commented 1 year ago

按照这个commit 训练, 修改eval方法后, 但是训练占用显存仍然过高, 尤其是7b的 max_input_len=1024 都会显存溢出(4卡24G A10 全部用上), 33b的 max_input_len = 256 也会显存溢出(40G显存*6卡全能用上)。

你好,是训练时显存溢出还是生成时溢出?训练时请设置gradient_checkpointing=True。如果已经设置为True,请附上您的代码和配置文件以便进一步解决问题。

lileishitou commented 1 year ago

训练时溢出, 但是max_input_len 设的短可以训练(33b, max_input_len=64, 6卡40G显卡可训练); gradient_checkpointing=True 在 args_lomo.yaml 已设置; 显存降低不太明显, 没有出现二分类示例中的 类似降维打印。

KaiLv69 commented 1 year ago

训练时溢出, 但是max_input_len 设的短可以训练(33b, max_input_len=64, 6卡40G显卡可训练); gradient_checkpointing=True 在 args_lomo.yaml 已设置; 显存降低不太明显, 没有出现二分类示例中的 类似降维打印。

了解。如果方便的话请给出详细的训练代码以便进一步解决问题。

lileishitou commented 1 year ago

修改点:

(1)修改了mydataset.py 中process部分:输入数据为一行一个json样本: 类似于 class MyDataset(Dataset): def init(self, data_args, tokenizer, split): super().init() self.data_args = data_args self.tokenizer = tokenizer self.split = split self.sample_size = 300000

self.sample_size = dataset_info.sample_size

    #self.prompt_type = dataset_info.prompt_type
    #save_dir = os.path.join(data_args.data_dir, data_args.dataset_name, data_args.data_tag)
    save_dir = "/home/duser/lilei_workspace/LOMO_0711/LOMO/data/wic/base"
    if not os.path.exists(save_dir):
        os.makedirs(save_dir, exist_ok=True)
    save_file = os.path.join(save_dir, f'{split}.pt')
    if data_args.refresh or not os.path.exists(save_file):
        dataset = get_raw_dataset("MultiTurnAlpaca", "", 1234, 0)  # DatasetDict.from_json 加载进来的train和test数据集的字典
        self.data = self.process(dataset, save_file)
    else:
        print('Loading data from', save_file)
        self.data = torch.load(save_file)
    print('Data size:', len(self.data))
    print('Data format:', self.data[0])
    print('Max length:', max([len(d['input_ids']) for d in self.data])) if self.split == 'train' else \
        print('Max length:', max([max([len(d) for d in dd['input_ids']]) for dd in self.data]))
#
def process(self, dataset, save_file):
    data = []
    for instance in tqdm(dataset.raw_datasets[self.split]):
        source = instance['instruction']
        target = instance['output']
        targets = []
        def _tokenize_fn(source, target):
            targets.append(target)
            example = f"{source}{target}"
            example_tokenized = self.tokenizer.encode(example, truncation=True, max_length=self.data_args.data_max_length)
            example_tokenized = example_tokenized + [self.tokenizer.eos_token_id]
            source_tokenized = self.tokenizer.encode(source, truncation=True, max_length=self.data_args.data_max_length) # 源码修改点
            input_ids = example_tokenized
            labels = copy.deepcopy(input_ids)
            if not self.data_args.train_on_inputs:
                labels = np.array(labels)
                labels[:len(source_tokenized) - 1] = IGNORE_INDEX
            return input_ids, labels

        if self.split == 'train':
            input_ids, labels = _tokenize_fn(source, target)
        else:
            input_ids, labels = _tokenize_fn(source, target)
            input_ids = [input_ids] # 源码修改点
            labels = [labels]          # 源码修改点
            print("test labels:", labels)
            print("test target:", target)

        data.append({'input_ids': input_ids,
                     'labels': labels,
                     'source': source,
                     'target': targets,
                     'answer': 0})
    if self.sample_size > 0 and len(data) > self.sample_size:
        random.seed(REPRODUCIBILITY_SEED)
        possible_idxs = list(range(len(data)))
        sampled_idxs = random.sample(possible_idxs, self.sample_size)
        data = [data[i] for i in sampled_idxs] # 采样出来1000, len(data)
        print(f'Sampled {self.sample_size} examples from {len(possible_idxs)} examples.')

    torch.save(data, save_file)
    print('Saving data to', save_file)
    return data

(2)args_lomo.yaml配置文件:

model

model_name_or_path: '/home/duser/xxx_workspace/vicuna-33b-v1.3'

model_name_or_path: '/home/duser/xxx_workspace/alpaca-rlhf/LOMO/7b'

data

dataset_name: 'wic' refresh: false data_tag: 'base' train_on_inputs: false data_max_length: 1024

training

trainer

tag: 'lomo' output_dir: 'outputs' overwrite_output_dir: true deepspeed: 'config/ds_config.json' do_train: true do_eval: true evaluation_strategy: 'epoch' per_device_train_batch_size: 4 per_device_eval_batch_size: 2 learning_rate: 0.03 weight_decay: 0 num_train_epochs: 1 lr_scheduler_type: 'linear' warmup: 0.1 clip_grad_norm: 1.0 save_strategy: 'epoch' save_total_limit: 0 seed: 142 bf16: true remove_unused_columns: false load_best_model_at_end: false metric_for_best_model: 'acc' group_by_length: false

report_to: 'wandb'

dataloader_pin_memory: false gradient_checkpointing: true predict_with_generate: true

(3) eval_step 按照 commit https://github.com/OpenLMLab/LOMO/issues/8 已修改

(4)compute_metrix 暂未进行修改, 所以仍然以acc统计(还未换成perplexity)

lileishitou commented 1 year ago

数据集类似一行一个json样本, 一般是由 instruction作为source, output作为target

{"instruction": "A conversation takes place between Amy and his or her friend. Kevin responded to his or her friend's questions with everyday, humorous, witty answers. Amy is a neurobiologist and Sheldon's girlfriend, initially introduced to him by Howard and Raj on a dating website. She shares Sheldon's scientific mind but desires social and sexual experiences. Amy has a Ph.D. in neurobiology and often uses monkeys as experimental subjects. Despite being similar to Sheldon in personality, she has more social knowledge and persuades him to participate in various social activities. Amy becomes close friends with Penny and Bernadette and occasionally displays lesbian traits. After a long time of not having any physical relationship with Sheldon, they finally kiss on a train on Valentine's Day and eventually get engaged and married. They accidentally discover the theory of super asymmetry and win the Nobel Prize together.The context of the conversation is The stairwell. Leonard: I was unstoppable. I mean, I was, I was on fire. It was like my mind and my body were totally connected, like, like athletes must feel when they\u2019re in the zone. Penny: Again, it was miniature golf. Leonard: Admit it, you\u2019re a little turned on. Penny: You can\u2019t be this proud. Leonard: Why not? Penny: Because I beat you. Leonard: Hey. Penny: Hi. Sheldon: Oh, good. You\u2019re back. Amy:", "input": "", "output": "We have some exciting news."}

KaiLv69 commented 1 year ago

看起来没什么问题,不知道你直接使用repo里的代码跑wic任务显存占用情况如何? BTW,如果想要加长seq_len,可以把per_device_train_batch_size从4调到1.

lileishitou commented 1 year ago

数据都用的是标准wic: 7b model, max_seq_len = 2048, 16 per train size, 16 per val size, 6卡 用满平均每卡占用37G显存。 33b model, max seq len = 1024, 1 per train size, 1per val size, 6卡 用满平均每卡占用显存47G.

lileishitou commented 1 year ago

所以很奇怪用自己的数据集 33b model max_seq_len 在 1 per train size, 1per val size 下, max_seq_len只能达到64, 否则就会out of memory

KaiLv69 commented 1 year ago

数据都用的是标准wic: 7b model, max_seq_len = 2048, 16 per train size, 16 per val size, 6卡 用满平均每卡占用37G显存。 33b model, max seq len = 1024, 1 per train size, 1per val size, 6卡 用满平均每卡占用显存47G.

wic的数据本来就比较短,比较大的max_seq_len对他的长度不会有影响。对于wic,即使batch size是16,7b的模型应该不会需要6*37G的显存这么多才对。

Zheng-Jay commented 1 year ago

你好,感谢你对lomo的关注。 如果想使用自定义数据集,可以参考src/mydatasets.pysrc/utils.py里的DataCollatorForCauselLM()进行修改。如果要进行评测,需要把参数里的predict_with_generate设置为True并且传给trainer对应的compute_metrics()。这里有类似的issue: #8 。

你好,我用指令数据集进行llama2中文微调,修改了mydatasets.py,但是进行ceavl评测出来的效果很差,不知道是不是我修改得不对,烦请指正

数据集示例

{
    "instruction": "我有一个计算机相关的问题,请用中文回答,什么是 计算机科学与技术",
    "input": "",
    "output": "计算机科学与技术(Computer Science and Technology)是一门普通高等学校本科专业,属于计算机类专业,基本修业年限为四年,授予工学或理学学士学位;2012年9月,教育部将新的计算机科学与技术专业取代旧的计算机科学与技术和仿真科学与技术两个专业。 \n计算机科学与技术是一个计算机系统与网络兼顾的计算机学科宽口径专业,旨在培养具有良好的科学素养,具有自主学习意识和创新意识,科学型和工程型相结合的计算机专业高水平工程技术人才。"
},
{
    "instruction": "我有一个信息科学相关的问题,请用中文回答,什么是 零售业务提供商",
    "input": "",
    "output": "零售业务提供商(retail service provider)是指向最终用户提供服务的运营商。 \n零售业务提供商可接受连接提供商、第三方业务提供商、代理商及其他零售商提供的业务;可为消费者或其他零售商提供业务者。 \n消费者既可以是住宅终端用户,也可是企业终端用户。这两种用户是业务的最终使用者。零售业务提供商管理用户的业务接入和定购。它既能确保业务提供,又能将这一功能承包给第三方业务提供商。"
}

修改的代码

以下是仿造原来的处理逻辑,对mydatasets.py中的get_dataset_info()函数的修改,新增了一个条件判断,处理本地数据集

    # 从本地加载
    elif dataset_name == 'local':
        return DatasetInfo(
            path="super_glue",
            name="local",
            exemplar_split="train",
            eval_split="validation",
            sample_size=99999999999999999,
            prompt_type='natural',
            extractor=lambda row: {
                "parts": [
                    QuestionPart(
                        row['instruction'] + row['input']
                    ),
                ],
                "choices": [
                    row['output']
                ],
                "answer_idx":0
            }