OpenMOSS / MOSS

An open-source tool-augmented conversational language model from Fudan University
https://txsun1997.github.io/blogs/moss.html
Apache License 2.0
11.93k stars 1.15k forks source link

微调后导出67g的bin文件,之后怎么做呢 #327

Open mayurou opened 1 year ago

mayurou commented 1 year ago

我把原生模型下面的config.json等全都拷贝到新生成的bin文件同目录下了,将pytorch_model.bin.index 将所有的模型名称改成了pytorch_model.bin,启动moss_inferrance.py报错TypeError: expected str, bytes or os.PathLike object, not NoneType 请问有人成功吧微调后的模型部署推理了吗?应该如何做呢?另,求一个官方微信群的邀请~

lhtpluto commented 1 year ago

我把原生模型下面的config.json等全都拷贝到新生成的bin文件同目录下了,将pytorch_model.bin.index 将所有的模型名称改成了pytorch_model.bin,启动moss_inferrance.py报错TypeError: expected str, bytes or os.PathLike object, not NoneType 请问有人成功吧微调后的模型部署推理了吗?应该如何做呢?另,求一个官方微信群的邀请~

你的.bin文件怎么生成的啊? 我只生成了.pt文件

mayurou commented 1 year ago

我把原生模型下面的config.json等全都拷贝到新生成的bin文件同目录下了,将pytorch_model.bin.index 将所有的模型名称改成了pytorch_model.bin,启动moss_inferrance.py报错TypeError: expected str, bytes or os.PathLike object, not NoneType 请问有人成功吧微调后的模型部署推理了吗?应该如何做呢?另,求一个官方微信群的邀请~

你的.bin文件怎么生成的啊? 我只生成了.pt文件

运行微调后产出的zero_to_fp32.py文件,命令是python zero_to_fp32.py 参数一:【pt文件所在目录】 参数二:【想要输出合成的文件路径checkpoint/pytorch_model.bin】,例如python zero_to_fp32.py ./ checkpoint/pytorch_model.bin

但保存完bin文件之后,我就不知道怎么操作才能推理了。。

lhtpluto commented 1 year ago

万分感谢,同卡在fp32的60GB大文件了

是不是需要先转换成fp16 half ?

mayurou commented 1 year ago

万分感谢,同卡在fp32的60GB大文件了

是不是需要先转换成fp16 half ?

我是直接把60g大文件load进一张显卡去推理了,load的时候。half().cuda() 我的显卡有80g目前还够用

mayurou commented 1 year ago

应用我目前的实验数据,moss这样子微调后准确率和chatgml6B差不太多,moss85%,chatgml86%,远不及Chinese-Alpaca-Plus-13B 91%

yihuaxiang commented 1 year ago

请教下,如何进行模型微调?我想开发自己的插件+微调自己的模型,没看到这块的文档🙏

MiyazonoKaori commented 1 year ago

修改保存的地方: self.accelerator.wait_for_everyone() unwrapped_model = self.accelerator.unwrap_model(self.model) unwrapped_model.save_pretrained( save_dir, is_main_process=self.accelerator.is_main_process, save_function=self.accelerator.save, state_dict=self.accelerator.get_state_dict(self.model), )

或者修改读取的代码: from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map from accelerate.utils import get_balanced_memory

config = MossConfig.from_pretrained(model_path) tokenizer = MossTokenizer.from_pretrained(model_path)

with init_empty_weights(): raw_model = MossForCausalLM._from_config(config, torch_dtype=torch.float16) raw_model.tie_weights()

max_memory = get_balanced_memory( raw_model, max_memory=None, no_split_module_classes=["MossBlock"], dtype=torch.float16, low_zero=False, )

device_map = infer_auto_device_map( raw_model, max_memory=max_memory, no_split_module_classes=["MossBlock"], dtype=torch.float16 )

model = MossForCausalLM.from_pretrained( model_path, device_map=device_map, offload_folder="offload", offload_state_dict=True, torch_dtype=torch.float16 )

lhtpluto commented 1 year ago

求大神发完整的保存fp16的finetune_moss.py

另,我用 accelerator = Accelerator(mixed_precision='bf16') 训练的也可以吗?

lhtpluto commented 1 year ago

Epoch: 0, Step: 19, Val loss: 1.11943359375, Val acc: 0.7218769788742065 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ai/MOSS/finetune_moss.py:324 in │ │ │ │ 321 │ os.makedirs(args.output_dir, exist_ok=True) │ │ 322 │ │ │ 323 │ set_seed(args.seed) │ │ ❱ 324 │ train(args) │ │ 325 │ │ │ │ /home/ai/MOSS/finetune_moss.py:284 in train │ │ │ │ 281 │ │ #model.save_checkpoint(args.output_dir, global_step) │ │ 282 │ │ accelerator.wait_for_everyone() │ │ 283 │ │ unwrapped_model = accelerator.unwrap_model(model) │ │ ❱ 284 │ │ unwrapped_model.save_checkpoint(args.output_dir, global_step) │ │ 285 │ │ # unwrapped_model.save_pretrained( │ │ 286 │ │ # args.output_dir, │ │ 287 │ │ # is_main_process=accelerator.is_main_process, │ │ │ │ /root/anaconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py:1630 in │ │ getattr │ │ │ │ 1627 │ │ │ modules = self.dict['_modules'] │ │ 1628 │ │ │ if name in modules: │ │ 1629 │ │ │ │ return modules[name] │ │ ❱ 1630 │ │ raise AttributeError("'{}' object has no attribute '{}'".format( │ │ 1631 │ │ │ type(self).name, name)) │ │ 1632 │ │ │ 1633 │ def setattr(self, name: str, value: Union[Tensor, 'Module']) -> None: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'MossForCausalLM' object has no attribute 'save_checkpoint' [2023-06-16 08:54:52,657] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 818) of binary: /root/anaconda3/envs/moss/bin/python

尝试修改保存文件报错

MiyazonoKaori commented 1 year ago

finetune_moss - 副本.txt

lhtpluto commented 1 year ago

finetune_moss - 副本.txt

万分感谢

测试可用,而且能直接生成: config.json generation_config.json pytorch_model.bin.index.json pytorch_model-00001-of-00004.bin pytorch_model-00002-of-00004.bin pytorch_model-00003-of-00004.bin pytorch_model-00004-of-00004.bin

lhtpluto commented 1 year ago

finetune_moss - 副本.txt

求指点,明明train.jsonl中只有20条数据,val.jsonl中也只有20条数据,使用finetune_moss - 副本.txt后,Load data successfully, total 1000 training samples

晕了 ,训练量一下子增加了50倍

目前我的临时解决办法是,用原有finetune_moss.py 标记数据,然后用您的finetune_moss - 副本.txt再训练数据,这种情况训练量是正常的。

MiyazonoKaori commented 1 year ago

1686905626110 测试用的,忘了删

lhtpluto commented 1 year ago

1686905626110 测试用的,忘了删

Time to load fused_adam op: 0.1592857837677002 seconds /root/anaconda3/envs/moss/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) 06/16/2023 17:06:01 - INFO - main - Loading data... INFO:main:Loading data... 06/16/2023 17:06:01 - INFO - main - Load data successfully, total 0 training samples INFO:main:Load data successfully, total 0 training samples ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ai/MOSS/finetune_moss.py:366 in │ │ │ │ 363 │ os.makedirs(args.output_dir, exist_ok=True) │ │ 364 │ │ │ 365 │ set_seed(args.seed) │ │ ❱ 366 │ train(args) │ │ 367 │ │ │ │ /home/ai/MOSS/finetune_moss.py:217 in train │ │ │ │ 214 │ optimizer = AdamOptimizer(optimizer_grouped_parameters, lr=args.learning_rate) │ │ 215 │ │ │ 216 │ train_dataset = SFTDataset(args.data_dir, tokenizer) │ │ ❱ 217 │ train_dataloader = DataLoader(train_dataset, batch_size=args.train_bsz_per_gpu, shuf │ │ 218 │ │ │ 219 │ val_dataset = SFTDataset(args.data_dir, tokenizer, data_type='val') │ │ 220 │ val_dataloader = DataLoader(val_dataset, batch_size=args.eval_bsz_per_gpu, shuffle=F │ │ │ │ /root/anaconda3/envs/moss/lib/python3.8/site-packages/torch/utils/data/dataloader.py:351 in │ │ init │ │ │ │ 348 │ │ │ │ sampler = _InfiniteConstantSampler() │ │ 349 │ │ │ else: # map-style │ │ 350 │ │ │ │ if shuffle: │ │ ❱ 351 │ │ │ │ │ sampler = RandomSampler(dataset, generator=generator) # type: ignor │ │ 352 │ │ │ │ else: │ │ 353 │ │ │ │ │ sampler = SequentialSampler(dataset) # type: ignore[arg-type] │ │ 354 │ │ │ │ /root/anaconda3/envs/moss/lib/python3.8/site-packages/torch/utils/data/sampler.py:141 in │ │ init │ │ │ │ 138 │ │ │ │ │ │ │ "replacement={}".format(self.replacement)) │ │ 139 │ │ │ │ 140 │ │ if not isinstance(self.num_samples, int) or self.num_samples <= 0: │ │ ❱ 141 │ │ │ raise ValueError("num_samples should be a positive integer " │ │ 142 │ │ │ │ │ │ │ "value, but got num_samples={}".format(self.num_samples)) │ │ 143 │ │ │ 144 │ @property │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: num_samples should be a positive integer value, but got num_samples=0

显示num_samples=0

MiyazonoKaori commented 1 year ago

这里你直接改成原有finetune_moss.py 的代码就行了,不要那个for,直接append。用点子智慧咂

lhtpluto commented 1 year ago

这里你直接改成原有finetune_moss.py 的代码就行了,不要那个for,直接append。用点子智慧咂

出错原因是没有删除上次的train_data文件 删除后恢复正常

万分感谢您的帮助

TrWestdoor commented 1 year ago

从 huggingface 上下载对应的 vocab.json 和 merges.txt 放到路径里