单卡A100训练7B模型OOM

HuihuiChyan commented 1 year ago

我正在尝试按照 llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml 来对7B模型进行instruct-tuning，batch_size设置为1，sequence_length设置为512，其他设置都没有做更改，但是爆显存不足，请问有什么可能的解决方案吗？

SparkJiao commented 1 year ago

首先尝试zero-1设置下的CPU offload：

zero_optimization:
    stage: 1
    contiguous_gradients: True
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 5e8
    allgather_bucket_size: 5e8
    offload_optimizer:
      device: cpu
      pin_memory: True

如果还是不行的话，把stage改成3，然后开启offload_param:

  zero_optimization:
    stage: 3
    contiguous_gradients: True
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 5e8
    allgather_bucket_size: 5e8
    offload_optimizer:
      device: cpu
      pin_memory: True
    offload_param:
      device: cpu
      pin_memory: True

需要注意此时速度可能会非常慢，取决你的机器的gpu和cpu之间通信的速度

HuihuiChyan commented 1 year ago

两个offload都开启了，还是不行……请问是否和单卡有关系？

SparkJiao commented 1 year ago

你的显存是多少G？40G？能否贴一下config和训练启动命令？

HuihuiChyan commented 1 year ago

我的显卡是80G的A100，config如下：

hydra:
  run:
    dir: ./

# Wiki path pretrain v8.2
model_name_or_path: /mnt/bn/slp-llm/sft_huihui/pandallm/llama-panda-zh-7b
pretrain:

aws_output_bucket:
data_dir:

train_file: /opt/ml/input/data/train/coig.json
dev_file:
test_file:

# Model
model:
  _target_: models.llama.LlamaForConditionalGeneration.from_pretrained
  vocab_size: 32001
  pad_token_id: 32000
  use_peft: False
  gradient_checkpointing: True

# model_eval:
#   _target_: models.llama.LlamaForConditionalGenerationFlan.from_pretrained_peft_eval
#   base_model_name_or_path: ${model_name_or_path}

# Data loading
read_tensor:
  _target_: data.collators.zh_instruct.TextDatasetUnify

extended_vocab:

# Data collator
collator:
  _target_: data.collators.flan.FlanCollatorOverCollator
  collator:
  max_seq_length: 512
  tokenizer: ${model_name_or_path}
  decoder_only: True

# Dataloader
num_workers: 4
prefetch_factor: 2

do_preprocess: False

exp_name: llama.7b.zh_instruct.10M.coig.sft.v1.0.seq1024.w8.adamw.NA100.0428.ds
exp_notes:
output_dir: ./${exp_name}
resume:

do_train: True
evaluate_during_training: False

do_eval: False
eval_sub_path: checkpoint-*

# Training hyper-parameters
per_gpu_train_batch_size: 1
per_gpu_eval_batch_size: 1
learning_rate: 3e-5
gradient_accumulation_steps: 2
weight_decay: 0.00
adam_epsilon: 1e-6
adam_betas: "(0.9, 0.99)"
max_grad_norm: 5.0
num_train_epochs: 5
total_dataset_len: -1
max_steps: 0
warmup_proportion: 0.01
warmup_steps: 0

# Optimizer
optimizer:
use_nvlamb:
bit_training:

logging_steps: 1
save_best: False
save_steps: 250
eval_steps: 250
ddp_eval: True
no_cuda: False
seed: 42
local_rank: -1
fp16: True
fp16_opt_level: O1
fp16_bfloat16: True

# Prediction config
prediction_cfg:
  metric: "acc"
  measure: 1
  best_checkpoint:
  best_result:
eval_forward_fn:
  _target_: general_util.evaluator.DiscriminatorForwardFn
post_process:

# fairscale.FullyShardedDP
fairscale_config:
  _target_: general_util.fsdp_utils.default_initialize
  # _target_: general_util.fsdp_utils.recursive_initialize
  # _target_: general_util.fsdp_utils.default_initialize_v2
  # _target_: general_util.torch_fsdp_utils.torch_fsdp_transformer_init
  # _target_: general_util.torch_fsdp_utils.torch_fsdp_auto_wrap
  fp16: ${fp16}
  move_grads_to_cpu: False
  move_params_to_cpu: False
  flatten_parameters: False
  # fp16_bfloat16: ${fp16_bfloat16}
  # cpu_offload: True
  # disable_reshard_on_root: False

# Lightseq config
with_lightseq: False

# Deepspeed config
ds_cfg:
  train_micro_batch_size_per_gpu: ${per_gpu_train_batch_size}
  gradient_accumulation_steps: ${gradient_accumulation_steps}
  optimizer:
    type: AdamW
    params:
      lr: ${learning_rate}
      betas: [0.9, 0.99]
      eps: ${adam_epsilon}
      weight_decay: ${weight_decay}
  scheduler:
    type: WarmupDecayLR
    params:
      total_num_steps:
      warmup_max_lr: ${learning_rate}
      warmup_num_steps:
      warmup_type: linear
  gradient_clipping: ${max_grad_norm}
  # fp16:
  #   enabled: ${fp16}
  #   initial_scale_power: 12
  bf16:
    enabled: ${fp16}
  # autotuning:
  #   enabled: true
  #   arg_mappings:
  #     train_micro_batch_size_per_gpu: "per_gpu_train_batch_size"
  #     gradient_accumulation_steps: "gradient_accumulation_steps"
  #     zero_optimization: "ds_cfg.zero_optimization"
  zero_optimization:
    stage: 3
    contiguous_gradients: True
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 5e8
    allgather_bucket_size: 5e8
  offload_optimizer:
    device: cpu
    pin_memory: True
  offload_param:
    device: cpu
    pin_memory: True
  # activation_checkpointing:
  #   partition_activations: True
  #   cpu_checkpointing: True
  #   contiguous_memory_optimization: False
  #   number_checkpoints: False
  #   synchronize_checkpoint_boundary: False
  #   profile: False
  steps_per_print: 1024

summary_helper:
#  _target_: general_util.tensorboard_helper.SummaryWriterHelper
  _target_: general_util.tensorboard_helper.WandbWriter
  batch_index_or_keys:
#    "train/pair_value_num": pair_value_num
#    "train/pair_label_num": pair_label_num
#    "train/dropped_op_cnt": dropped_op_cnt
#    "train/invalid_path": invalid_path
  outputs_index_or_keys:
#    "train/mlm_loss": mlm_loss
#    "train/cls_loss": cls_loss
#    "train/tagging_loss": tagging_loss
#    "train/path_gen_loss": path_gen_loss

# Temporary variables
n_gpu:
device:
train_batch_size:
eval_batch_size:
world_size:
world_rank:

训练开启命令如下：

export HYDRA_FULL_ERROR=1
export WANDB_MODE=dryrun
export WANDB_SILENT=true
deepspeed --include localhost:0 \
    trainer_base_ds_mul.py \
    -cp conf/llama/zh \
    -cn llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml

SparkJiao commented 1 year ago

这个代码确实没看出来有什么问题你是否检查一下你的服务器0号卡的显存是充足的没有其他人在使用？按理说zero-3+cpu offload 40G显存就足够微调了

SparkJiao commented 1 year ago

以及方便的话可以把你的error message放出来

HuihuiChyan commented 1 year ago

好的，训练的log如下：

[2023-05-29 19:48:32,312] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-29 19:48:32,326] [INFO] [runner.py:550:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None trainer_base_ds_mul.py -cp conf/llama/zh -cn llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml
[2023-05-29 19:48:33,790] [INFO] [launch.py:135:main] 0 LAB_PYTORCH_NCCL_SCM_VERSION=1.0.0.1
[2023-05-29 19:48:33,790] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-29 19:48:33,790] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-29 19:48:33,790] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-29 19:48:33,790] [INFO] [launch.py:162:main] dist_world_size=1
[2023-05-29 19:48:33,790] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
['trainer_base_ds_mul.py', 'local_rank=0', '-cp', 'conf/llama/zh', '-cn', 'llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml']
[2023-05-29 19:48:36,856] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-29 19:48:36,856][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-05-29 19:48:36,857][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
[2023-05-29 19:48:36,859][FK][WARNING] - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
[2023-05-29 19:48:36,859][FK][WARNING] - CPU cores: 128
[2023-05-29 19:51:28,268][FK.general_util.tokenization_utils][INFO] - LlamaTokenizerFast(name_or_path='/mnt/bn/slp-llm/sft_huihui/pandallm/llama-panda-zh-7b', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False)
[2023-05-29 19:51:28,269][FK.general_util.tokenization_utils][INFO] - PAD TOKEN ID = 32000
[2023-05-29 19:52:11,607][FK.models.llama][INFO] - gradient_checkpointing: True
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:21<00:00,  7.25s/it]
[2023-05-29 19:52:34,217][FK.models.llama][INFO] - Config pad token id after loading pre-trained weights: 32000
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - Logs details:
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - None
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - None
[2023-05-29 19:52:40,317][FK][INFO] - []
0it [00:00, ?it/s]
[2023-05-29 19:52:40,466] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-05-29 19:52:53,269][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2023-05-29 19:52:53,272][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-05-29 19:52:53,338] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.1588904857635498 seconds
[2023-05-29 19:52:53,863] [INFO] [logging.py:93:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-05-29 19:52:53,875] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-05-29 19:52:53,875] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2023-05-29 19:52:53,875] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-05-29 19:52:53,979] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning
[2023-05-29 19:52:53,980] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:52:53,980] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 79.61 GB, percent = 4.0%
[2023-05-29 19:52:53,981] [INFO] [stage3.py:113:__init__] Reduce bucket size 500000000
[2023-05-29 19:52:53,982] [INFO] [stage3.py:114:__init__] Prefetch bucket size 50,000,000
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.14214134216308594 seconds
[2023-05-29 19:52:54,227] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-05-29 19:52:54,228] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:52:54,228] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 79.88 GB, percent = 4.0%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-05-29 19:52:54,373] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-05-29 19:52:54,374] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.83 GB         CA 13.08 GB         Max_CA 13 GB 
[2023-05-29 19:52:54,374] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 80.02 GB, percent = 4.0%
[2023-05-29 19:52:54,471] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions
[2023-05-29 19:52:54,472] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 13.08 GB         Max_CA 13 GB 
[2023-05-29 19:52:54,472] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 80.1 GB, percent = 4.0%
[2023-05-29 19:53:06,285] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 7
[2023-05-29 19:53:06,286] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:53:06,287] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.14 GB, percent = 4.0%
[2023-05-29 19:53:06,386] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions
[2023-05-29 19:53:06,387] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:53:06,387] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.15 GB, percent = 4.0%
[2023-05-29 19:53:06,539] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions
[2023-05-29 19:53:06,540] [INFO] [utils.py:830:see_memory_usage] MA 37.69 GB         Max_MA 38.94 GB         CA 41.47 GB         Max_CA 41 GB 
[2023-05-29 19:53:06,540] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.16 GB, percent = 4.0%
[2023-05-29 19:53:06,641] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-05-29 19:53:06,642] [INFO] [utils.py:830:see_memory_usage] MA 37.69 GB         Max_MA 37.69 GB         CA 41.47 GB         Max_CA 41 GB 
[2023-05-29 19:53:06,644] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.17 GB, percent = 4.0%
Error executing job with overrides: ['local_rank=0']
Traceback (most recent call last):
  File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 437, in <module>
    main()
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 367, in main
    global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step)
  File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 168, in train
    model, optimizer, _, scheduler = deepspeed.initialize(model=model,
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1599, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 312, in __init__
    self._setup_for_real_optimizer()
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 371, in _setup_for_real_optimizer
    self.initialize_optimizer_states()
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 938, in initialize_optimizer_states
    self._optimizer_step(i)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 858, in _optimizer_step
    self.optimizer.step()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 137, in step
    state['exp_avg_sq'] = torch.zeros_like(p.data)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.77 GiB (GPU 0; 79.35 GiB total capacity; 75.35 GiB already allocated; 2.95 GiB free; 75.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:437 in <module>                       │
│                                                                                                  │
│   434 │   │   │   hydra_formatted_args.append(arg)                                               │
│   435 │   sys.argv = hydra_formatted_args                                                        │
│   436 │   print(sys.argv)                                                                        │
│ ❱ 437 │   main()                                                                                 │
│   438                                                                                            │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/main.py:94 in decorated_main                 │
│                                                                                                  │
│    91 │   │   │   │   else:                                                                      │
│    92 │   │   │   │   │   # no return value from run_hydra() as it may sometime actually run t   │
│    93 │   │   │   │   │   # multiple times (--multirun)                                          │
│ ❱  94 │   │   │   │   │   _run_hydra(                                                            │
│    95 │   │   │   │   │   │   args=args,                                                         │
│    96 │   │   │   │   │   │   args_parser=args_parser,                                           │
│    97 │   │   │   │   │   │   task_function=task_function,                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:394 in _run_hydra         │
│                                                                                                  │
│   391 │   │                                                                                      │
│   392 │   │   if args.run or args.multirun:                                                      │
│   393 │   │   │   run_mode = hydra.get_mode(config_name=config_name, overrides=overrides)        │
│ ❱ 394 │   │   │   _run_app(                                                                      │
│   395 │   │   │   │   run=args.run,                                                              │
│   396 │   │   │   │   multirun=args.multirun,                                                    │
│   397 │   │   │   │   mode=run_mode,                                                             │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:457 in _run_app           │
│                                                                                                  │
│   454 │   │   │   overrides.extend(["hydra.mode=MULTIRUN"])                                      │
│   455 │                                                                                          │
│   456 │   if mode == RunMode.RUN:                                                                │
│ ❱ 457 │   │   run_and_report(                                                                    │
│   458 │   │   │   lambda: hydra.run(                                                             │
│   459 │   │   │   │   config_name=config_name,                                                   │
│   460 │   │   │   │   task_function=task_function,                                               │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:223 in run_and_report     │
│                                                                                                  │
│   220 │   │   return func()                                                                      │
│   221 │   except Exception as ex:                                                                │
│   222 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│ ❱ 223 │   │   │   raise ex                                                                       │
│   224 │   │   else:                                                                              │
│   225 │   │   │   try:                                                                           │
│   226 │   │   │   │   if isinstance(ex, CompactHydraException):                                  │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:220 in run_and_report     │
│                                                                                                  │
│   217                                                                                            │
│   218 def run_and_report(func: Any) -> Any:                                                      │
│   219 │   try:                                                                                   │
│ ❱ 220 │   │   return func()                                                                      │
│   221 │   except Exception as ex:                                                                │
│   222 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│   223 │   │   │   raise ex                                                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:458 in <lambda>           │
│                                                                                                  │
│   455 │                                                                                          │
│   456 │   if mode == RunMode.RUN:                                                                │
│   457 │   │   run_and_report(                                                                    │
│ ❱ 458 │   │   │   lambda: hydra.run(                                                             │
│   459 │   │   │   │   config_name=config_name,                                                   │
│   460 │   │   │   │   task_function=task_function,                                               │
│   461 │   │   │   │   overrides=overrides,                                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/hydra.py:132 in run                │
│                                                                                                  │
│   129 │   │   callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret)          │
│   130 │   │                                                                                      │
│   131 │   │   # access the result to trigger an exception in case the job failed.                │
│ ❱ 132 │   │   _ = ret.return_value                                                               │
│   133 │   │                                                                                      │
│   134 │   │   return ret                                                                         │
│   135                                                                                            │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py:260 in return_value            │
│                                                                                                  │
│   257 │   │   │   sys.stderr.write(                                                              │
│   258 │   │   │   │   f"Error executing job with overrides: {self.overrides}" + os.linesep       │
│   259 │   │   │   )                                                                              │
│ ❱ 260 │   │   │   raise self._return_value                                                       │
│   261 │                                                                                          │
│   262 │   @return_value.setter                                                                   │
│   263 │   def return_value(self, value: Any) -> None:                                            │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py:186 in run_job                 │
│                                                                                                  │
│   183 │   │   with env_override(hydra_cfg.hydra.job.env_set):                                    │
│   184 │   │   │   callbacks.on_job_start(config=config, task_function=task_function)             │
│   185 │   │   │   try:                                                                           │
│ ❱ 186 │   │   │   │   ret.return_value = task_function(task_cfg)                                 │
│   187 │   │   │   │   ret.status = JobStatus.COMPLETED                                           │
│   188 │   │   │   except Exception as e:                                                         │
│   189 │   │   │   │   ret.return_value = e                                                       │
│                                                                                                  │
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:367 in main                           │
│                                                                                                  │
│   364 │   │   │   logger.info("Resuming training from the latest checkpoint: %s", checkpoint)    │
│   365 │   │   │   continue_from_global_step = int(checkpoint.split('-')[-1])                     │
│   366 │   │                                                                                      │
│ ❱ 367 │   │   global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step)     │
│   368 │   │   logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)          │
│   369 │                                                                                          │
│   370 │   # Test                                                                                 │
│                                                                                                  │
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:168 in train                          │
│                                                                                                  │
│   165 │   #      'weight_decay': 0.0}                                                            │
│   166 │   # ]                                                                                    │
│   167 │   torch.compile(model, mode="max-autotune")                                              │
│ ❱ 168 │   model, optimizer, _, scheduler = deepspeed.initialize(model=model,                     │
│   169 │   │   │   │   │   │   │   │   │   │   │   │   │   │     model_parameters=model.paramet   │
│   170 │   │   │   │   │   │   │   │   │   │   │   │   │   │     config=ds_config)                │
│   171 │   logger.info(optimizer.optimizer)                                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/__init__.py:125 in initialize            │
│                                                                                                  │
│   122 │   assert model is not None, "deepspeed.initialize requires a model"                      │
│   123 │                                                                                          │
│   124 │   if not isinstance(model, PipelineModule):                                              │
│ ❱ 125 │   │   engine = DeepSpeedEngine(args=args,                                                │
│   126 │   │   │   │   │   │   │   │    model=model,                                              │
│   127 │   │   │   │   │   │   │   │    optimizer=optimizer,                                      │
│   128 │   │   │   │   │   │   │   │    model_parameters=model_parameters,                        │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:340 in __init__        │
│                                                                                                  │
│    337 │   │   │   model_parameters = list(model_parameters)                                     │
│    338 │   │                                                                                     │
│    339 │   │   if has_optimizer:                                                                 │
│ ❱  340 │   │   │   self._configure_optimizer(optimizer, model_parameters)                        │
│    341 │   │   │   self._configure_lr_scheduler(lr_scheduler)                                    │
│    342 │   │   │   self._report_progress(0)                                                      │
│    343 │   │   elif self.zero_optimization():                                                    │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1298 in                │
│ _configure_optimizer                                                                             │
│                                                                                                  │
│   1295 │   │   optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer)              │
│   1296 │   │                                                                                     │
│   1297 │   │   if optimizer_wrapper == ZERO_OPTIMIZATION:                                        │
│ ❱ 1298 │   │   │   self.optimizer = self._configure_zero_optimizer(basic_optimizer)              │
│   1299 │   │   elif optimizer_wrapper == AMP:                                                    │
│   1300 │   │   │   amp_params = self.amp_params()                                                │
│   1301 │   │   │   log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0])      │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1599 in                │
│ _configure_zero_optimizer                                                                        │
│                                                                                                  │
│   1596 │   │   │   │   log_dist(f'Creating {model_dtype} ZeRO stage {zero_stage} optimizer',     │
│   1597 │   │   │   │   │   │    ranks=[0])                                                       │
│   1598 │   │   │   │   from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3   │
│ ❱ 1599 │   │   │   │   optimizer = DeepSpeedZeroOptimizer_Stage3(                                │
│   1600 │   │   │   │   │   self.module,                                                          │
│   1601 │   │   │   │   │   optimizer,                                                            │
│   1602 │   │   │   │   │   timers=timers,                                                        │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:312 in __init__   │
│                                                                                                  │
│    309 │   │   │   f'Largest partitioned param numel = {largest_partitioned_param_numel}',       │
│    310 │   │   │   force=False)                                                                  │
│    311 │   │                                                                                     │
│ ❱  312 │   │   self._setup_for_real_optimizer()                                                  │
│    313 │   │   self.grad_position = {}                                                           │
│    314 │   │   self.set_grad_positions()                                                         │
│    315                                                                                           │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:371 in            │
│ _setup_for_real_optimizer                                                                        │
│                                                                                                  │
│    368 │   │                                                                                     │
│    369 │   │   see_memory_usage("Before initializing optimizer states", force=True)              │
│    370 │   │                                                                                     │
│ ❱  371 │   │   self.initialize_optimizer_states()                                                │
│    372 │   │   see_memory_usage("After initializing optimizer states", force=True)               │
│    373 │   │   dist.barrier()                                                                    │
│    374                                                                                           │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:938 in            │
│ initialize_optimizer_states                                                                      │
│                                                                                                  │
│    935 │   │   │   │   │   0,                                                                    │
│    936 │   │   │   │   │   num_elements)                                                         │
│    937 │   │   │                                                                                 │
│ ❱  938 │   │   │   self._optimizer_step(i)                                                       │
│    939 │   │   │                                                                                 │
│    940 │   │   │   if swappable_param_subgroup:                                                  │
│    941 │   │   │   │   self._partitioned_params_swap_out(i)                                      │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:858 in            │
│ _optimizer_step                                                                                  │
│                                                                                                  │
│    855 │   │   fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]                      │
│    856 │   │   self.optimizer.param_groups[param_group_id]['params'] = [fp32_param]              │
│    857 │   │                                                                                     │
│ ❱  858 │   │   self.optimizer.step()                                                             │
│    859 │   │   self.optimizer.param_groups[param_group_id]['params'] = []                        │
│    860 │                                                                                         │
│    861 │   def _swappable_optimizer_subgroup(self, sub_group_id):                                │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/torch/optim/optimizer.py:280 in wrapper            │
│                                                                                                  │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return None or a tuple of (   │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")                       │
│   279 │   │   │   │                                                                              │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                                                │
│   281 │   │   │   │   self._optimizer_step_code()                                                │
│   282 │   │   │   │                                                                              │
│   283 │   │   │   │   # call optimizer step post hooks                                           │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py:137 in step       │
│                                                                                                  │
│   134 │   │   │   │   │   # Exponential moving average of gradient values                        │
│   135 │   │   │   │   │   state['exp_avg'] = torch.zeros_like(p.data)                            │
│   136 │   │   │   │   │   # Exponential moving average of squared gradient values                │
│ ❱ 137 │   │   │   │   │   state['exp_avg_sq'] = torch.zeros_like(p.data)                         │
│   138 │   │   │   │                                                                              │
│   139 │   │   │   │   if p.dtype == torch.float16:                                               │
│   140 │   │   │   │   │   g_16.append(p.grad.data)                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.77 GiB (GPU 0; 79.35 GiB total capacity; 75.35 GiB already allocated; 2.95 
GiB free; 75.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid 
fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-29 19:53:11,094] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2250
[2023-05-29 19:53:11,094] [ERROR] [launch.py:324:sigkill_handler] ['/root/miniconda3/bin/python', '-u', 'trainer_base_ds_mul.py', '--local_rank=0', '-cp', 'conf/llama/zh', '-cn', 'llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml'] exits with return code = 1

SparkJiao commented 1 year ago

看上去你的0号卡连模型权重都放不进去，还没有开始训练，这是不可能的，所以用nvidia-smi命令先确认一下你的 0 号显卡显存是充足的，可以吗？

HuihuiChyan commented 1 year ago

好的，下面是我的0号卡信息：

SparkJiao commented 1 year ago

那你现在再试一下？我看错误log已经是一个小时以前的了

HuihuiChyan commented 1 year ago

谢谢您提供的解决方案，其实有个很好奇的问题想正好请教一下您，为什么开源的模型大多基于7B、13B的模型，而30和60B的模型就很少，从13B到30B之间是否有一个什么GAP存在？

SparkJiao commented 1 year ago

30B模型很难放进80G显存需要更多修改比如model parallel或tensor parallel 但这些都不是开箱即用的需要你针对特定的模型结构写特定的代码这个难度比较大只是爱好者的话也不会去下功夫学

dandelionsllm / pandallm

单卡A100训练7B模型OOM #25