Closed yuqie closed 1 year ago
@yuqie Hi! If without deepspeed, you should change the AMP optimizer to the normal one:
Full parameter fine-tuning relies on the fp16/bf16 training optimization of DeepSpeed, thus it requires the launch of DeepSpeed.
If you are using a single card, you can try it with deepspeed_zero2
or deepspeed_zero2_offload
, while using multiple cards, you can try it with deepspeed_zero3
or deepspeed_zero3_offload
!
Moreover, the default setting of our deepspeed json config is fp16
, and you can change it to bf16
to reach more stable training!
- "fp16": {
+ "bf16": {
"enabled": true,
"initial_scale_power": 16
}
@yuqie Hi! ~If without deepspeed, you should change the AMP optimizer to the normal one:~
Full parameter fine-tuning relies on the fp16/bf16 training optimization of DeepSpeed, thus it requires the launch of DeepSpeed. If you are using a single card, you can try it with
deepspeed_zero2
ordeepspeed_zero2_offload
, while using multiple cards, you can try it withdeepspeed_zero3
ordeepspeed_zero3_offload
!Moreover, the default setting of our deepspeed json config is
fp16
, and you can change it tobf16
to reach more stable training!- "fp16": { + "bf16": { "enabled": true, "initial_scale_power": 16 }
@LZHgrla Thanks for you answer.
When I tried deepspeed zero3 with multicards to finetune llama2 with alpaca, I encontered the following error. The low_cpu_mem_usage=True
is set to False in the configure file and I didn't pass device_map
parameter in TrainingArguments and model load. does it related to the version of deepspeed, transformers or peft?
Traceback (most recent call last):
File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 246, in <module>
main()
File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 116, in main
model = BUILDER.build(cfg.model)
File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2445, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
@yuqie
This is strange. Could you please paste the versions of deepspeed
, transformers
, and the training command?
transformers 4.33.0 peft 0.5.0 torch 2.1.0 CUDA Version 12.2 Python 3.10.13
Hi, the versoin is:
deepspeed 0.11.1
transformers 4.33.0
peft 0.5.0
torch 2.1.0
CUDA Version 12.2
Python 3.10.13
and I tried HF framework and mmengine with the following commands:
NPROC_PER_NODE=8 xtuner train llama2_7b_qlora_alpaca_hf.py
NPROC_PER_NODE=8 xtuner train llama2_7b_qlora_alpaca.py --deepspeed ./zero3.json
llama2_7b_qlora_alpaca_hf.py
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
BitsAndBytesConfig, TrainingArguments)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
framework = 'huggingface'
pretrained_model_name_or_path = './local/llama-2-7b-hf/'
dataset_name_or_path = './local/data/'
max_length = 2048
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.alpaca
trainer = Trainer
deepspeed_config='./zero3.json'
training_args = dict(
type=TrainingArguments,
do_train=True,
learning_rate=3e-4,
weight_decay=0,
lr_scheduler_type='cosine',
warmup_steps=100,
per_device_train_batch_size=8,
gradient_accumulation_steps=16,
num_train_epochs=1,
fp16=True,
logging_steps=1,
optim='adamw_torch',
save_strategy='steps',
save_steps=1000,
save_total_limit=2,
deepspeed =deepspeed_config,
ddp_find_unused_parameters=False)
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'))
lora = dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
target_modules=['gate_proj', 'down_proj', 'up_proj'],
bias='none',
task_type='CAUSAL_LM')
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=dataset_name_or_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
llama2_7b_qlora_alpaca.py
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from bitsandbytes.optim import PagedAdamW32bit
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = './local/llama-2-7b-hf/'
# Data
alpaca_en_path = './local/data/'
prompt_template = PROMPT_TEMPLATE.alpaca
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = PagedAdamW32bit
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
# Evaluate the generation performance during the training
evaluation_freq = 50
evaluation_inputs = [
'请▒~Y▒~H~Q▒~K▒~M▒~T个▒~J海▒~Z~D▒~Y▒▒~B▒', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = dict(
type=CosineAnnealingLR,
eta_min=lr * 0.1,
by_epoch=True,
T_max=max_epochs,
convert_to_iter_based=True)
# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
instruction=prompt_template.INSTRUCTION_START)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per epoch.
checkpoint=dict(type=CheckpointHook, interval=1),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
zero3.json
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"contiguous_gradients": false,
"allgather_bucket_size": 3e8,
"reduce_bucket_size": 3e8,
"overlap_comm": true,
"reduce_scatter": true,
"stage3_gather_16bit_weights_on_model_save": true
},
"low_cpu_mem_usage": false,
"fp16": {
"enabled": true,
"initial_scale_power": 16
}
}
@yuqie It seems that you are fine-tuning with QLoRA, instead of full-parameter. For QLoRA, deepspeed only support the zero2 setting. You can try with deepspeed zero2.
@yuqie It seems that you are fine-tuning with QLoRA, instead of full-parameter. For QLoRA, deepspeed only support the zero2 setting. You can try with deepspeed zero2.
Yes, I tried qlora with deepspeed zero3.
As I understand, qlora only support zero2 or qlora without deepspeed strategy. How about full-parameter or lora, does lora support other deepspeed strategy?And apart from zero-2, zero-3, zero-2 offload and zero-3 offload, does full-parameter fintune support zero-1?
Full-parameter finetune supports all strategy With LoRA or QLoRA, since there exists frozen parameters, it doesn't support the zero-3.
Thanks for your answer!
One more question is I changed the stage from zero-2 to zero-0 (means no deepspeed zero strategy) and run the qlora fine-tuning, the results inculding GPU memory and time consuming is the same with that of zero-2. zero-1 is also tried and the results is same with that of zero-2, too.
In our experiments (InternLM-20B + QLoRA), with a single A100 GPU, zero-2 can achieve 25% acceleration and reduce the memory requirements from 29GB to 24GB.
You can paste your configs and commands, and I can test it!
I can get the reasonable results if I delete the line of deepspeed parameter. but if I pass zero-1 or zero-0 json file to deepspeed paramter, I cannot obtain resonable results that the GPU memory and time consuming are same with the results of zero-2.
I tried with HF framework by using the following command:
NPROC_PER_NODE=8 xtuner train llama2_7b_qlora_alpaca_hf.py
the script of llama2_7b_qlora_alpaca_hf.py
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
BitsAndBytesConfig, TrainingArguments)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
framework = 'huggingface'
pretrained_model_name_or_path = './local/llama-2-7b-hf/'
dataset_name_or_path = './local/data/'
max_length = 2048
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.alpaca
trainer = Trainer
deepspeed_config='./zero0.json'
training_args = dict(
type=TrainingArguments,
do_train=True,
learning_rate=3e-4,
weight_decay=0,
lr_scheduler_type='cosine',
warmup_steps=100,
per_device_train_batch_size=8,
gradient_accumulation_steps=16,
num_train_epochs=1,
fp16=True,
logging_steps=1,
optim='adamw_torch',
save_strategy='steps',
save_steps=1000,
save_total_limit=2,
# deepspeed =deepspeed_config, # when comment this line, get the reasonable results while pass ‘./zero0.json’,a strange result
ddp_find_unused_parameters=False)
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'))
lora = dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
target_modules=['gate_proj', 'down_proj', 'up_proj'],
bias='none',
task_type='CAUSAL_LM')
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=dataset_name_or_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
zero2.json
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"allgather_bucket_size": 1e8,
"reduce_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true
},
"fp16": {
"enabled": true,
"initial_scale_power": 16
}
}
zero1.json
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_optimization": {
"stage": 1,
"contiguous_gradients": false,
"allgather_bucket_size": 1e8,
"reduce_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true
},
"fp16": {
"enabled": true,
"initial_scale_power": 16
}
}
zero0.json
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"zero_optimization": {
"stage": 0
},
"fp16": {
"enabled": true,
"auto_cast": true,
"initial_scale_power": 16,
"loss_scale_window": 1000
}
}
Yes, these results meet the expectation!
In fact, while fine-tuning with LoRA/QLoRA, the acceleration and memory reduction are mainly introduced by DeepSpeed fp16/bf16, not ZeRO.
One reason is that with LoRA/QLoRA, there are merely few parameters and communication between GPUs is not a computing bottleneck.
I know that for qlora the GPU memory consumption for parameters and gradients is small. It's still so weird that the results for "zero0" is largely different.
I think there must be something wrong but I don't know the key.
If without DeepSpeed fp16/bf16, the default setting is using PyTorch AMP. AMP is not efficient. I think the difference is from here.
@yuqie Hi! ~If without deepspeed, you should change the AMP optimizer to the normal one:~ Full parameter fine-tuning relies on the fp16/bf16 training optimization of DeepSpeed, thus it requires the launch of DeepSpeed. If you are using a single card, you can try it with
deepspeed_zero2
ordeepspeed_zero2_offload
, while using multiple cards, you can try it withdeepspeed_zero3
ordeepspeed_zero3_offload
! Moreover, the default setting of our deepspeed json config isfp16
, and you can change it tobf16
to reach more stable training!- "fp16": { + "bf16": { "enabled": true, "initial_scale_power": 16 }
@LZHgrla Thanks for you answer.
When I tried deepspeed zero3 with multicards to finetune llama2 with alpaca, I encontered the following error. The
low_cpu_mem_usage=True
is set to False in the configure file and I didn't passdevice_map
parameter in TrainingArguments and model load. does it related to the version of deepspeed, transformers or peft?Traceback (most recent call last): File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 246, in <module> main() File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 116, in main model = BUILDER.build(cfg.model) File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained return model_class.from_pretrained( File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2445, in from_pretrained raise ValueError( ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
Hi, @LZHgrla I also enconter this error with full parameter fine tune llama2 with alpace dataset. thus qlora it's not the reason.
the script I use:
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
BitsAndBytesConfig, TrainingArguments)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
framework = 'huggingface'
pretrained_model_name_or_path = '/llama-2-7b-hf/'
dataset_name_or_path = '/data/'
max_length = 256
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.alpaca
deepspeed_config='zero3.json'
trainer = Trainer
training_args = dict(
type=TrainingArguments,
do_train=True,
learning_rate=3e-4,
weight_decay=0,
lr_scheduler_type='cosine',
warmup_steps=100,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
num_train_epochs=1,
fp16=True,
logging_steps=1,
optim='adamw_torch',
save_strategy='steps',
save_steps=1000,
save_total_limit=2,
deepspeed =deepspeed_config,
ddp_find_unused_parameters=False)
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,)
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=dataset_name_or_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
zero3.json:
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"contiguous_gradients": false,
"allgather_bucket_size": 3e8,
"reduce_bucket_size": 3e8,
"overlap_comm": true,
"reduce_scatter": true,
"stage3_gather_16bit_weights_on_model_save": true
},
"low_cpu_mem_usage": false,
"fp16": {
"enabled": true,
"initial_scale_power": 16
}
}
with the version as following:
deepspeed 0.11.1
transformers 4.33.0
peft 0.5.0
torch 2.1.0
CUDA Version 12.2
Python 3.10.13
@yuqie Hi, I have found the problems!
To easily solve this error, you can remove this line https://github.com/InternLM/xtuner/blob/b1af677639d41df2956e46bd31836cde7cb133bb/xtuner/tools/train.py#L114
and add "train_batch_size": "auto",
to your deepspeed config (see https://github.com/huggingface/accelerate/pull/2060).
We will fix it asap!
@yuqie Hi, I have found the problems!
To easily solve this error, you can remove this line
and add
"train_batch_size": "auto",
to your deepspeed config (see huggingface/accelerate#2060).We will fix it asap!
Thanks for your reply @LZHgrla. That works! By the way, do you have any test data for zero2 and zero3 comparison. I finetune full parameter llama2-7b with zero2 and zero3, and found the GPU memory consumption decreased by ~2G. For 7b model and FP16 with 8 GPU cards, the GPU footprint would be decreased by 77B2Bytes/8≈12GB if change the strategy from zero2 to zero3.
Following is the deepspeed conf file for zero2 and zero3:
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"allgather_bucket_size": 1e8,
"reduce_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true
},
"fp16": {
"enabled": true,
"initial_scale_power": 16
}
}
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"contiguous_gradients": false,
"allgather_bucket_size": 1e8,
"reduce_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"stage3_gather_16bit_weights_on_model_save": true
},
"low_cpu_mem_usage": false,
"fp16": {
"enabled": true,
"initial_scale_power": 16
}
}
https://www.deepspeed.ai/tutorials/zero/#zero-overview Stage 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition. Stage 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states. Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
@yuqie I think this small reduction in memory may be due to that ZeRO-3 will automatically configure the sharding strategy. But in fact, I currently have no ideas about how to evaluate it.
I also meet the small reduction (~2GB) during the full parameter fine-tuning of Llama2-7B with ZeRO-2 and -3. But I find when fine-tuning Llama2-7B with LoRA, the memory reduction becomes large, 19GB --> 11GB. Maybe we need to delve into DeepSpeed's paper, blog, and code to figure out the specific reasons.
I modified
llama2_7b_full_wizardlm_e1_copy.py
with alpaca_dataset and added parametertorch_dtype=torch.float16
in model loading, as following:if I run the script with deepspeed it's ok, while if I run it without deepspeed the error
the setting version are as below
I use 80G A800 and llama2-7b, If
torch_dtype=torch.float16
is deleted OOM will happened. Could any help me with this problem, any suggestion would be appreciated.