InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.87k stars 304 forks source link

How can I do full parameter fine-tuning the model with FP16 #161

Closed yuqie closed 1 year ago

yuqie commented 1 year ago

I modified llama2_7b_full_wizardlm_e1_copy.py with alpaca_dataset and added parameter torch_dtype=torch.float16 in model loading, as following:

model = dict(
    type=SupervisedFinetune,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16
        )
    )

if I run the script with deepspeed it's ok, while if I run it without deepspeed the error

ValueError: Attempting to unscale FP16 gradients.ValueError
: Attempting to unscale FP16 gradients.
    return wrapped(*args, **kwargs)
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/optim/optimizer/amp_optimizer_wrapper.py", line 136, in step
    self.loss_scaler.unscale_(self.optimizer)
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")

the setting version are as below

transformers             4.33.0
peft                           0.5.0
torch                         2.1.0
CUDA Version           12.2
Python                      3.10.13

I use 80G A800 and llama2-7b, If torch_dtype=torch.float16 is deleted OOM will happened. Could any help me with this problem, any suggestion would be appreciated.

LZHgrla commented 1 year ago

@yuqie Hi! If without deepspeed, you should change the AMP optimizer to the normal one:

Full parameter fine-tuning relies on the fp16/bf16 training optimization of DeepSpeed, thus it requires the launch of DeepSpeed. If you are using a single card, you can try it with deepspeed_zero2 or deepspeed_zero2_offload, while using multiple cards, you can try it with deepspeed_zero3 or deepspeed_zero3_offload!

Moreover, the default setting of our deepspeed json config is fp16, and you can change it to bf16 to reach more stable training!

- "fp16": {
+ "bf16": {
    "enabled": true,
    "initial_scale_power": 16
  }
yuqie commented 1 year ago

@yuqie Hi! ~If without deepspeed, you should change the AMP optimizer to the normal one:~

Full parameter fine-tuning relies on the fp16/bf16 training optimization of DeepSpeed, thus it requires the launch of DeepSpeed. If you are using a single card, you can try it with deepspeed_zero2 or deepspeed_zero2_offload, while using multiple cards, you can try it with deepspeed_zero3 or deepspeed_zero3_offload!

Moreover, the default setting of our deepspeed json config is fp16, and you can change it to bf16 to reach more stable training!

- "fp16": {
+ "bf16": {
    "enabled": true,
    "initial_scale_power": 16
  }

@LZHgrla Thanks for you answer.

When I tried deepspeed zero3 with multicards to finetune llama2 with alpaca, I encontered the following error. The low_cpu_mem_usage=True is set to False in the configure file and I didn't pass device_map parameter in TrainingArguments and model load. does it related to the version of deepspeed, transformers or peft?

Traceback (most recent call last):
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 246, in <module>
    main()
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 116, in main
    model = BUILDER.build(cfg.model)
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2445, in from_pretrained
    raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
LZHgrla commented 1 year ago

@yuqie This is strange. Could you please paste the versions of deepspeed, transformers, and the training command?

yuqie commented 1 year ago

transformers 4.33.0 peft 0.5.0 torch 2.1.0 CUDA Version 12.2 Python 3.10.13

Hi, the versoin is:

deepspeed                0.11.1
transformers             4.33.0
peft                           0.5.0
torch                         2.1.0
CUDA Version           12.2
Python                      3.10.13

and I tried HF framework and mmengine with the following commands:

NPROC_PER_NODE=8 xtuner train llama2_7b_qlora_alpaca_hf.py 
NPROC_PER_NODE=8 xtuner train llama2_7b_qlora_alpaca.py --deepspeed ./zero3.json

llama2_7b_qlora_alpaca_hf.py

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
                          BitsAndBytesConfig, TrainingArguments)

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE

framework = 'huggingface'

pretrained_model_name_or_path = './local/llama-2-7b-hf/'
dataset_name_or_path = './local/data/'
max_length = 2048
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.alpaca

trainer = Trainer

deepspeed_config='./zero3.json'

training_args = dict(
    type=TrainingArguments,
    do_train=True,
    learning_rate=3e-4,
    weight_decay=0,
    lr_scheduler_type='cosine',
    warmup_steps=100,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    fp16=True,
    logging_steps=1,
    optim='adamw_torch',
    save_strategy='steps',
    save_steps=1000,
    save_total_limit=2,
    deepspeed =deepspeed_config,
    ddp_find_unused_parameters=False)

tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=AutoModelForCausalLM.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=dict(
        type=BitsAndBytesConfig,
        load_in_4bit=True,
        load_in_8bit=False,
        llm_int8_threshold=6.0,
        llm_int8_has_fp16_weight=False,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4'))

lora = dict(
    type=LoraConfig,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=['gate_proj', 'down_proj', 'up_proj'],
    bias='none',
    task_type='CAUSAL_LM')

train_dataset = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path=dataset_name_or_path),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=alpaca_map_fn,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length)

llama2_7b_qlora_alpaca.py

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from bitsandbytes.optim import PagedAdamW32bit
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
pretrained_model_name_or_path = './local/llama-2-7b-hf/'

# Data
alpaca_en_path = './local/data/'
prompt_template = PROMPT_TEMPLATE.alpaca
max_length = 2048
pack_to_max_length = True

# Scheduler & Optimizer
batch_size = 8  # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = PagedAdamW32bit
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip

# Evaluate the generation performance during the training
evaluation_freq = 50
evaluation_inputs = [
    '请▒~Y▒~H~Q▒~K▒~M▒~T个▒~J海▒~Z~D▒~Y▒▒~B▒', 'Please tell me five scenic spots in Shanghai'
]

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=SupervisedFinetune,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        quantization_config=dict(
            type=BitsAndBytesConfig,
            load_in_4bit=True,
            load_in_8bit=False,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4')),
    lora=dict(
        type=LoraConfig,
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        bias='none',
        task_type='CAUSAL_LM'))

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
alpaca_en = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path=alpaca_en_path),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=alpaca_map_fn,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=alpaca_en,
    sampler=dict(type=DefaultSampler, shuffle=True),
    collate_fn=dict(type=default_collate_fn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = dict(
    type=CosineAnnealingLR,
    eta_min=lr * 0.1,
    by_epoch=True,
    T_max=max_epochs,
    convert_to_iter_based=True)

# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    dict(
        type=EvaluateChatHook,
        tokenizer=tokenizer,
        every_n_iters=evaluation_freq,
        evaluation_inputs=evaluation_inputs,
        instruction=prompt_template.INSTRUCTION_START)
]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 100 iterations.
    logger=dict(type=LoggerHook, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per epoch.
    checkpoint=dict(type=CheckpointHook, interval=1),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

zero3.json

{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "zero_force_ds_cpu_optimizer": false,
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": false,
    "allgather_bucket_size": 3e8,
    "reduce_bucket_size": 3e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "low_cpu_mem_usage": false,
  "fp16": {
    "enabled": true,
    "initial_scale_power": 16
  }
}
LZHgrla commented 1 year ago

@yuqie It seems that you are fine-tuning with QLoRA, instead of full-parameter. For QLoRA, deepspeed only support the zero2 setting. You can try with deepspeed zero2.

yuqie commented 1 year ago

@yuqie It seems that you are fine-tuning with QLoRA, instead of full-parameter. For QLoRA, deepspeed only support the zero2 setting. You can try with deepspeed zero2.

Yes, I tried qlora with deepspeed zero3.

As I understand, qlora only support zero2 or qlora without deepspeed strategy. How about full-parameter or lora, does lora support other deepspeed strategy?And apart from zero-2, zero-3, zero-2 offload and zero-3 offload, does full-parameter fintune support zero-1?

LZHgrla commented 1 year ago

Full-parameter finetune supports all strategy With LoRA or QLoRA, since there exists frozen parameters, it doesn't support the zero-3.

yuqie commented 1 year ago

Thanks for your answer!

One more question is I changed the stage from zero-2 to zero-0 (means no deepspeed zero strategy) and run the qlora fine-tuning, the results inculding GPU memory and time consuming is the same with that of zero-2. zero-1 is also tried and the results is same with that of zero-2, too.

LZHgrla commented 1 year ago

In our experiments (InternLM-20B + QLoRA), with a single A100 GPU, zero-2 can achieve 25% acceleration and reduce the memory requirements from 29GB to 24GB.

You can paste your configs and commands, and I can test it!

yuqie commented 1 year ago

I can get the reasonable results if I delete the line of deepspeed parameter. but if I pass zero-1 or zero-0 json file to deepspeed paramter, I cannot obtain resonable results that the GPU memory and time consuming are same with the results of zero-2.

I tried with HF framework by using the following command:

NPROC_PER_NODE=8 xtuner train llama2_7b_qlora_alpaca_hf.py 

the script of llama2_7b_qlora_alpaca_hf.py

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
                          BitsAndBytesConfig, TrainingArguments)

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE

framework = 'huggingface'

pretrained_model_name_or_path = './local/llama-2-7b-hf/'
dataset_name_or_path = './local/data/'
max_length = 2048
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.alpaca

trainer = Trainer

deepspeed_config='./zero0.json'

training_args = dict(
    type=TrainingArguments,
    do_train=True,
    learning_rate=3e-4,
    weight_decay=0,
    lr_scheduler_type='cosine',
    warmup_steps=100,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    fp16=True,
    logging_steps=1,
    optim='adamw_torch',
    save_strategy='steps',
    save_steps=1000,
    save_total_limit=2,
 #   deepspeed =deepspeed_config, # when comment this line, get the reasonable results while pass ‘./zero0.json’,a strange result
    ddp_find_unused_parameters=False)

tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=AutoModelForCausalLM.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=dict(
        type=BitsAndBytesConfig,
        load_in_4bit=True,
        load_in_8bit=False,
        llm_int8_threshold=6.0,
        llm_int8_has_fp16_weight=False,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4'))

lora = dict(
    type=LoraConfig,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=['gate_proj', 'down_proj', 'up_proj'],
    bias='none',
    task_type='CAUSAL_LM')

train_dataset = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path=dataset_name_or_path),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=alpaca_map_fn,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length)

zero2.json

{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "zero_optimization": {
    "stage": 2,
    "contiguous_gradients": false,
    "allgather_bucket_size": 1e8,
    "reduce_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true
  },
  "fp16": {
    "enabled": true,
    "initial_scale_power": 16
  }
}

zero1.json

{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "zero_optimization": {
    "stage": 1,
    "contiguous_gradients": false,
    "allgather_bucket_size": 1e8,
    "reduce_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true
  },
  "fp16": {
    "enabled": true,
    "initial_scale_power": 16
  }
}

zero0.json


  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "zero_optimization": {
        "stage": 0
    },
  "fp16": {
       "enabled": true,
       "auto_cast": true,
       "initial_scale_power": 16,
       "loss_scale_window": 1000
    }
}
LZHgrla commented 1 year ago

Yes, these results meet the expectation!

In fact, while fine-tuning with LoRA/QLoRA, the acceleration and memory reduction are mainly introduced by DeepSpeed fp16/bf16, not ZeRO.

One reason is that with LoRA/QLoRA, there are merely few parameters and communication between GPUs is not a computing bottleneck.

yuqie commented 1 year ago

I know that for qlora the GPU memory consumption for parameters and gradients is small. It's still so weird that the results for "zero0" is largely different.

I think there must be something wrong but I don't know the key.

LZHgrla commented 1 year ago

If without DeepSpeed fp16/bf16, the default setting is using PyTorch AMP. AMP is not efficient. I think the difference is from here.

yuqie commented 1 year ago

@yuqie Hi! ~If without deepspeed, you should change the AMP optimizer to the normal one:~ Full parameter fine-tuning relies on the fp16/bf16 training optimization of DeepSpeed, thus it requires the launch of DeepSpeed. If you are using a single card, you can try it with deepspeed_zero2 or deepspeed_zero2_offload, while using multiple cards, you can try it with deepspeed_zero3 or deepspeed_zero3_offload! Moreover, the default setting of our deepspeed json config is fp16, and you can change it to bf16 to reach more stable training!

- "fp16": {
+ "bf16": {
    "enabled": true,
    "initial_scale_power": 16
  }

@LZHgrla Thanks for you answer.

When I tried deepspeed zero3 with multicards to finetune llama2 with alpaca, I encontered the following error. The low_cpu_mem_usage=True is set to False in the configure file and I didn't pass device_map parameter in TrainingArguments and model load. does it related to the version of deepspeed, transformers or peft?

Traceback (most recent call last):
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 246, in <module>
    main()
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 116, in main
    model = BUILDER.build(cfg.model)
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2445, in from_pretrained
    raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.

Hi, @LZHgrla I also enconter this error with full parameter fine tune llama2 with alpace dataset. thus qlora it's not the reason.

the script I use:

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
                          BitsAndBytesConfig, TrainingArguments)

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE

framework = 'huggingface'

pretrained_model_name_or_path = '/llama-2-7b-hf/'
dataset_name_or_path = '/data/'
max_length = 256
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.alpaca
deepspeed_config='zero3.json'

trainer = Trainer

training_args = dict(
    type=TrainingArguments,
    do_train=True,
    learning_rate=3e-4,
    weight_decay=0,
    lr_scheduler_type='cosine',
    warmup_steps=100,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    fp16=True,
    logging_steps=1,
    optim='adamw_torch',
    save_strategy='steps',
    save_steps=1000,
    save_total_limit=2,
    deepspeed =deepspeed_config,
    ddp_find_unused_parameters=False)

tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=AutoModelForCausalLM.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,)

train_dataset = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path=dataset_name_or_path),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=alpaca_map_fn,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length)

zero3.json:

{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "zero_force_ds_cpu_optimizer": false,
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": false,
    "allgather_bucket_size": 3e8,
    "reduce_bucket_size": 3e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "low_cpu_mem_usage": false,
  "fp16": {
    "enabled": true,
    "initial_scale_power": 16
  }
}

with the version as following:

deepspeed                0.11.1
transformers             4.33.0
peft                           0.5.0
torch                         2.1.0
CUDA Version           12.2
Python                      3.10.13
LZHgrla commented 1 year ago

@yuqie Hi, I have found the problems!

To easily solve this error, you can remove this line https://github.com/InternLM/xtuner/blob/b1af677639d41df2956e46bd31836cde7cb133bb/xtuner/tools/train.py#L114

and add "train_batch_size": "auto", to your deepspeed config (see https://github.com/huggingface/accelerate/pull/2060).

We will fix it asap!

LZHgrla commented 1 year ago

https://github.com/InternLM/xtuner/pull/164

yuqie commented 1 year ago

@yuqie Hi, I have found the problems!

To easily solve this error, you can remove this line

https://github.com/InternLM/xtuner/blob/b1af677639d41df2956e46bd31836cde7cb133bb/xtuner/tools/train.py#L114

and add "train_batch_size": "auto", to your deepspeed config (see huggingface/accelerate#2060).

We will fix it asap!

Thanks for your reply @LZHgrla. That works! By the way, do you have any test data for zero2 and zero3 comparison. I finetune full parameter llama2-7b with zero2 and zero3, and found the GPU memory consumption decreased by ~2G. For 7b model and FP16 with 8 GPU cards, the GPU footprint would be decreased by 77B2Bytes/8≈12GB if change the strategy from zero2 to zero3.

Following is the deepspeed conf file for zero2 and zero3:

{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "zero_optimization": {
    "stage": 2,
    "contiguous_gradients": false,
    "allgather_bucket_size": 1e8,
    "reduce_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true
  },
  "fp16": {
    "enabled": true,
    "initial_scale_power": 16
  }
}
{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "train_batch_size": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "zero_force_ds_cpu_optimizer": false,
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": false,
    "allgather_bucket_size": 1e8,
    "reduce_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "low_cpu_mem_usage": false,
  "fp16": {
    "enabled": true,
    "initial_scale_power": 16
  }
}
LZHgrla commented 1 year ago

https://www.deepspeed.ai/tutorials/zero/#zero-overview Stage 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition. Stage 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states. Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

@yuqie I think this small reduction in memory may be due to that ZeRO-3 will automatically configure the sharding strategy. But in fact, I currently have no ideas about how to evaluate it.

I also meet the small reduction (~2GB) during the full parameter fine-tuning of Llama2-7B with ZeRO-2 and -3. But I find when fine-tuning Llama2-7B with LoRA, the memory reduction becomes large, 19GB --> 11GB. Maybe we need to delve into DeepSpeed's paper, blog, and code to figure out the specific reasons.