OpenMOSS / MOSS

An open-source tool-augmented conversational language model from Fudan University
https://txsun1997.github.io/blogs/moss.html
Apache License 2.0
11.9k stars 1.14k forks source link

Finetune提示out of memory #297

Open Tian14267 opened 1 year ago

Tian14267 commented 1 year ago

大神们好。我在4张A100上进行finetune,batch=1。但是还是会提示 out of memory。请问是啥情况啊

Tian14267 commented 1 year ago

调整成max_len = 1024,也同样会out of memory image

yangzhipeng1108 commented 1 year ago

需要8张A100,可以参考这个,int8也可以精调了 https://github.com/yangzhipeng1108/moss-finetune-and-moss-finetune-int8

Tian14267 commented 1 year ago

@yangzhipeng1108 我想问下,每张A100的卡,能加载完整的模型吗?如果可以的话,为啥4张不行呀。不太明白~ 另外,int8也可以finetune吗?我试试

yangzhipeng1108 commented 1 year ago

deepspeed_config: zero_stage: 3 moss的配置是deepspeed stage3 模型是分开加载到跟个模型,我使用1batch_size,8卡的瞬时基本每张卡都到70G显存以上了,官网没有开源,我改好,目前int8可以finetune

Tian14267 commented 1 year ago

@yangzhipeng1108 明白你的意思。真正训练的话,还是得模型分散加载到各个卡里是吧。这个MOSS也没开源。不知道大神你有没有能实现这样的。目前只能用int8。

yangzhipeng1108 commented 1 year ago

只要8张A100或者A800就能官方的实现finetune,官方finetune后16B的半精度推理速度太慢才考虑8int,我的项目里8int可以finetune

Tian14267 commented 1 year ago

@yangzhipeng1108 嗯嗯。对了,我在训练int8的时候,遇到一些问题:run_int8_acc.sh

Traceback (most recent call last):
  File "finetune_moss_int8_acc.py", line 319, in <module>
    train(args)           
  File "finetune_moss_int8_acc.py", line 177, in train
    model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path, trust_remote_code=True, use_cache=False)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 458, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2276, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 608, in __init__
    self.quantize(config.wbits, config.groupsize)
  File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 732, in quantize
    from .quantization import quantize_with_gptq
  File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 8, in <module>
    from .custom_autotune import *
ModuleNotFoundError: No module named 'transformers_modules.local.custom_autotune'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 39133) of binary: /opt/conda/envs/moss/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/envs/moss/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 909, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

请问你知道是啥问题嘛

yangzhipeng1108 commented 1 year ago

moss 没法把custom_autotune加载到/root/.cache/huggingface/modules/transformers_modules/local/ 下面,你手动把custom_autotune复制到/root/.cache/huggingface/modules/transformers_modules/local/即可

Tian14267 commented 1 year ago

@yangzhipeng1108 哈喽你好,我这边一直提示下面这个错误:


    size mismatch for transformer.h.31.attn.qkv_proj.qweight: copying a param with shape torch.Size([1536, 18432]) from checkpoint, the shape in current model is torch.Size([768, 18432]).
    size mismatch for transformer.h.31.attn.qkv_proj.qzeros: copying a param with shape torch.Size([48, 4608]) from checkpoint, the shape in current model is torch.Size([48, 2304]).
    size mismatch for transformer.h.31.mlp.fc_in.qweight: copying a param with shape torch.Size([1536, 24576]) from checkpoint, the shape in current model is torch.Size([768, 24576]).
    size mismatch for transformer.h.31.mlp.fc_in.qzeros: copying a param with shape torch.Size([48, 6144]) from checkpoint, the shape in current model is torch.Size([48, 3072]).
    size mismatch for transformer.h.31.mlp.fc_out.qweight: copying a param with shape torch.Size([6144, 6144]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).
    size mismatch for transformer.h.31.mlp.fc_out.qzeros: copying a param with shape torch.Size([192, 1536]) from checkpoint, the shape in current model is torch.Size([192, 768]).
    size mismatch for transformer.h.32.attn.out_proj.qweight: copying a param with shape torch.Size([1536, 6144]) from checkpoint, the shape in current model is torch.Size([768, 6144]).
    size mismatch for transformer.h.32.attn.out_proj.qzeros: copying a param with shape torch.Size([48, 1536]) from checkpoint, the shape in current model is torch.Size([48, 768]).
    size mismatch for transformer.h.32.attn.qkv_proj.qweight: copying a param with shape torch.Size([1536, 18432]) from checkpoint, the shape in current model is torch.Size([768, 18432]).
    size mismatch for transformer.h.32.attn.qkv_proj.qzeros: copying a param with shape torch.Size([48, 4608]) from checkpoint, the shape in current model is torch.Size([48, 2304]).
    size mismatch for transformer.h.32.mlp.fc_in.qweight: copying a param with shape torch.Size([1536, 24576]) from checkpoint, the shape in current model is torch.Size([768, 24576]).
    size mismatch for transformer.h.32.mlp.fc_in.qzeros: copying a param with shape torch.Size([48, 6144]) from checkpoint, the shape in current model is torch.Size([48, 3072]).
    size mismatch for transformer.h.32.mlp.fc_out.qweight: copying a param with shape torch.Size([6144, 6144]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).
    size mismatch for transformer.h.32.mlp.fc_out.qzeros: copying a param with shape torch.Size([192, 1536]) from checkpoint, the shape in current model is torch.Size([192, 768]).
    size mismatch for transformer.h.33.attn.out_proj.qweight: copying a param with shape torch.Size([1536, 6144]) from checkpoint, the shape in current model is torch.Size([768, 6144]).
    size mismatch for transformer.h.33.attn.out_proj.qzeros: copying a param with shape torch.Size([48, 1536]) from checkpoint, the shape in current model is torch.Size([48, 768]).
    size mismatch for transformer.h.33.attn.qkv_proj.qweight: copying a param with shape torch.Size([1536, 18432]) from checkpoint, the shape in current model is torch.Size([768, 18432]).
    size mismatch for transformer.h.33.attn.qkv_proj.qzeros: copying a param with shape torch.Size([48, 4608]) from checkpoint, the shape in current model is torch.Size([48, 2304]).
    size mismatch for transformer.h.33.mlp.fc_in.qweight: copying a param with shape torch.Size([1536, 24576]) from checkpoint, the shape in current model is torch.Size([768, 24576]).
    size mismatch for transformer.h.33.mlp.fc_in.qzeros: copying a param with shape torch.Size([48, 6144]) from checkpoint, the shape in current model is torch.Size([48, 3072]).
    size mismatch for transformer.h.33.mlp.fc_out.qweight: copying a param with shape torch.Size([6144, 6144]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).
    size mismatch for transformer.h.33.mlp.fc_out.qzeros: copying a param with shape torch.Size([192, 1536]) from checkpoint, the shape in current model is torch.Size([192, 768]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 762) of binary: /opt/conda/envs/moss/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/envs/moss/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 909, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

是不是尺寸不对啊,这个是啥情况

lhtpluto commented 1 year ago

deepspeed_config: zero_stage: 3 moss的配置是deepspeed stage3 模型是分开加载到跟个模型,我使用1batch_size,8卡的瞬时基本每张卡都到70G显存以上了,官网没有开源,我改好,目前int8可以finetune

大神,是指int8 的finetune 需要80GB8的显存吗? 不知道48GB8的显存能否finetune?

yangzhipeng1108 commented 1 year ago

deepspeed_config: zero_stage: 3 moss的配置是deepspeed stage3 模型是分开加载到跟个模型,我使用1batch_size,8卡的瞬时基本每张卡都到70G显存以上了,官网没有开源,我改好,目前int8可以finetune

大神,是指int8 的finetune 需要80GB_8的显存吗? 不知道48GB_8的显存能否finetune?

精调moss需要,int8不需要