Finetune提示out of memory

Tian14267 commented 1 year ago

大神们好。我在4张A100上进行finetune，batch=1。但是还是会提示 out of memory。请问是啥情况啊

Tian14267 commented 1 year ago

调整成max_len = 1024，也同样会out of memory

yangzhipeng1108 commented 1 year ago

需要8张A100，可以参考这个，int8也可以精调了 https://github.com/yangzhipeng1108/moss-finetune-and-moss-finetune-int8

Tian14267 commented 1 year ago

@yangzhipeng1108 我想问下，每张A100的卡，能加载完整的模型吗？如果可以的话，为啥4张不行呀。不太明白~ 另外，int8也可以finetune吗？我试试

yangzhipeng1108 commented 1 year ago

deepspeed_config: zero_stage: 3 moss的配置是deepspeed stage3 模型是分开加载到跟个模型，我使用1batch_size,8卡的瞬时基本每张卡都到70G显存以上了，官网没有开源，我改好，目前int8可以finetune

Tian14267 commented 1 year ago

@yangzhipeng1108 明白你的意思。真正训练的话，还是得模型分散加载到各个卡里是吧。这个MOSS也没开源。不知道大神你有没有能实现这样的。目前只能用int8。

yangzhipeng1108 commented 1 year ago

只要8张A100或者A800就能官方的实现finetune，官方finetune后16B的半精度推理速度太慢才考虑8int，我的项目里8int可以finetune

Tian14267 commented 1 year ago

@yangzhipeng1108 嗯嗯。对了，我在训练int8的时候，遇到一些问题：run_int8_acc.sh

Traceback (most recent call last):
  File "finetune_moss_int8_acc.py", line 319, in <module>
    train(args)           
  File "finetune_moss_int8_acc.py", line 177, in train
    model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path, trust_remote_code=True, use_cache=False)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 458, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2276, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 608, in __init__
    self.quantize(config.wbits, config.groupsize)
  File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 732, in quantize
    from .quantization import quantize_with_gptq
  File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 8, in <module>
    from .custom_autotune import *
ModuleNotFoundError: No module named 'transformers_modules.local.custom_autotune'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 39133) of binary: /opt/conda/envs/moss/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/envs/moss/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 909, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

请问你知道是啥问题嘛

yangzhipeng1108 commented 1 year ago

moss 没法把custom_autotune加载到/root/.cache/huggingface/modules/transformers_modules/local/ 下面，你手动把custom_autotune复制到/root/.cache/huggingface/modules/transformers_modules/local/即可

Tian14267 commented 1 year ago

@yangzhipeng1108 哈喽你好，我这边一直提示下面这个错误：


    size mismatch for transformer.h.31.attn.qkv_proj.qweight: copying a param with shape torch.Size([1536, 18432]) from checkpoint, the shape in current model is torch.Size([768, 18432]).
    size mismatch for transformer.h.31.attn.qkv_proj.qzeros: copying a param with shape torch.Size([48, 4608]) from checkpoint, the shape in current model is torch.Size([48, 2304]).
    size mismatch for transformer.h.31.mlp.fc_in.qweight: copying a param with shape torch.Size([1536, 24576]) from checkpoint, the shape in current model is torch.Size([768, 24576]).
    size mismatch for transformer.h.31.mlp.fc_in.qzeros: copying a param with shape torch.Size([48, 6144]) from checkpoint, the shape in current model is torch.Size([48, 3072]).
    size mismatch for transformer.h.31.mlp.fc_out.qweight: copying a param with shape torch.Size([6144, 6144]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).
    size mismatch for transformer.h.31.mlp.fc_out.qzeros: copying a param with shape torch.Size([192, 1536]) from checkpoint, the shape in current model is torch.Size([192, 768]).
    size mismatch for transformer.h.32.attn.out_proj.qweight: copying a param with shape torch.Size([1536, 6144]) from checkpoint, the shape in current model is torch.Size([768, 6144]).
    size mismatch for transformer.h.32.attn.out_proj.qzeros: copying a param with shape torch.Size([48, 1536]) from checkpoint, the shape in current model is torch.Size([48, 768]).
    size mismatch for transformer.h.32.attn.qkv_proj.qweight: copying a param with shape torch.Size([1536, 18432]) from checkpoint, the shape in current model is torch.Size([768, 18432]).
    size mismatch for transformer.h.32.attn.qkv_proj.qzeros: copying a param with shape torch.Size([48, 4608]) from checkpoint, the shape in current model is torch.Size([48, 2304]).
    size mismatch for transformer.h.32.mlp.fc_in.qweight: copying a param with shape torch.Size([1536, 24576]) from checkpoint, the shape in current model is torch.Size([768, 24576]).
    size mismatch for transformer.h.32.mlp.fc_in.qzeros: copying a param with shape torch.Size([48, 6144]) from checkpoint, the shape in current model is torch.Size([48, 3072]).
    size mismatch for transformer.h.32.mlp.fc_out.qweight: copying a param with shape torch.Size([6144, 6144]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).
    size mismatch for transformer.h.32.mlp.fc_out.qzeros: copying a param with shape torch.Size([192, 1536]) from checkpoint, the shape in current model is torch.Size([192, 768]).
    size mismatch for transformer.h.33.attn.out_proj.qweight: copying a param with shape torch.Size([1536, 6144]) from checkpoint, the shape in current model is torch.Size([768, 6144]).
    size mismatch for transformer.h.33.attn.out_proj.qzeros: copying a param with shape torch.Size([48, 1536]) from checkpoint, the shape in current model is torch.Size([48, 768]).
    size mismatch for transformer.h.33.attn.qkv_proj.qweight: copying a param with shape torch.Size([1536, 18432]) from checkpoint, the shape in current model is torch.Size([768, 18432]).
    size mismatch for transformer.h.33.attn.qkv_proj.qzeros: copying a param with shape torch.Size([48, 4608]) from checkpoint, the shape in current model is torch.Size([48, 2304]).
    size mismatch for transformer.h.33.mlp.fc_in.qweight: copying a param with shape torch.Size([1536, 24576]) from checkpoint, the shape in current model is torch.Size([768, 24576]).
    size mismatch for transformer.h.33.mlp.fc_in.qzeros: copying a param with shape torch.Size([48, 6144]) from checkpoint, the shape in current model is torch.Size([48, 3072]).
    size mismatch for transformer.h.33.mlp.fc_out.qweight: copying a param with shape torch.Size([6144, 6144]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).
    size mismatch for transformer.h.33.mlp.fc_out.qzeros: copying a param with shape torch.Size([192, 1536]) from checkpoint, the shape in current model is torch.Size([192, 768]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 762) of binary: /opt/conda/envs/moss/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/envs/moss/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 909, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/moss/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

是不是尺寸不对啊，这个是啥情况

lhtpluto commented 1 year ago

deepspeed_config: zero_stage: 3 moss的配置是deepspeed stage3 模型是分开加载到跟个模型，我使用1batch_size,8卡的瞬时基本每张卡都到70G显存以上了，官网没有开源，我改好，目前int8可以finetune

大神，是指int8 的finetune 需要80GB8的显存吗？不知道48GB8的显存能否finetune？

yangzhipeng1108 commented 1 year ago

deepspeed_config: zero_stage: 3 moss的配置是deepspeed stage3 模型是分开加载到跟个模型，我使用1batch_size,8卡的瞬时基本每张卡都到70G显存以上了，官网没有开源，我改好，目前int8可以finetune

大神，是指int8 的finetune 需要80GB_8的显存吗？不知道48GB_8的显存能否finetune？

精调moss需要，int8不需要

OpenMOSS / MOSS

Finetune提示out of memory #297