Open stringing opened 1 month ago
Please provide a YAML file to help us reproduce the issue, thanks.
Also, please try to set cfg.eval.count_flops
to False
. When enabling this, Torch's garbage collection mechanism may not be timely resulting in OOM. If you wan to get an exact GPU memory usage, please use torch.cuda.empty_cache()
.
Please provide a YAML file to help us reproduce the issue, thanks. Also, please try to set
cfg.eval.count_flops
toFalse
. When enabling this, Torch's garbage collection mechanism may not be timely resulting in OOM.
Thank you for your response, this is my YAML configuration:
use_gpu: True
device: 0
early_stop:
patience: 10
federate:
mode: standalone
client_num: 100
sample_client_num: 5
total_round_num: 10
save_to: "/root/autodl-tmp/model/finetuned_codellama13b/codellama13b.ckpt"
share_local_model: True
online_aggr: False
data:
root: /root/autodl-tmp/data/TutorCode
type: 'TutorCode.json@llm'
splits: [0.98,0.01,0.01]
splitter: 'iid'
dataloader:
batch_size: 1
llm:
tok_len: 2048
chat:
max_len: 2048
adapter:
use: True
args: [ { 'adapter_package': 'peft', 'adapter_method': 'qlora', 'r': 32, 'lora_alpha': 32, 'lora_dropout': 0.05, 'load_in_4bit': True, 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_compute_dtype': True, 'bnb_4bit_use_double_quant': True, 'module_int4': True, } ] # Quantized LoRA hyperparameter
mv_to_cpu: True
model:
type: 'codellama/CodeLlama-13b-Instruct-hf@huggingface_llm'
train:
local_update_steps: 30
batch_or_epoch: batch
is_enable_half: False
optimizer:
type: 'AdamW'
lr: 0.0001
weight_decay: 0.00
criterion:
type: CrossEntropyLoss
trainer:
type: llmtrainer
eval:
freq: 20
metrics: ['loss']
count_flops: False
This is the federated/llm/model/model_builder.py and adapter_builder.py that I modified to enable quantized LoRA:
from federatedscope.llm.model.adapter_builder import AdapterModel
def get_model_from_huggingface(model_name, config):
"""
Load a causal language model from HuggingFace transformers library.
Args:
model_name (str): The name of the pre-trained model to load.
config (Config): The configuration object that contains the model
parameters.
Returns:
AutoModelForCausalLM: A causal language model object.
"""
from transformers import AutoModelForCausalLM
kwargs = {}
if len(config.llm.cache.model):
kwargs['cache_dir'] = config.llm.cache.model
args = config.llm.adapter.args[0]
if args.get('adapter_method', 'lora') in ['qlora', 'qlora-pissa']:
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=args.pop('load_in_4bit', True),
bnb_4bit_quant_type=args.pop('bnb_4bit_quant_type', 'nf4'),
bnb_4bit_compute_dtype=torch.bfloat16 if args.pop('bnb_4bit_compute_dtype', True) else torch.float32,
bnb_4bit_use_double_quant=args.pop('bnb_4bit_use_double_quant', True),
)
kwargs['quantization_config'] = bnb_config
if args.get('bnb_4bit_compute_dtype', True):
kwargs['torch_dtype'] = torch.bfloat16
kwargs['device_map'] = f'cuda:{config.device}'
return AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
def get_model_from_modelscope(model_name, config):
"""
Load a causal language model from ModelScope models library.
Args:
model_name (str): The name of the pre-trained model to load.
config (Config): The configuration object that contains the model
parameters.
Returns:
Model: A causal language model object.
"""
from modelscope import AutoModelForCausalLM
kwargs = {}
if len(config.llm.cache.model):
kwargs['cache_dir'] = config.llm.cache.model
return AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
def get_llm(config):
"""
Get a causal language model based on the configuration.
Args:
config (Config): The configuration object that contains the model
parameters.
Returns:
AdapterModel: A causal language model object with optional adapter
layers.
"""
from federatedscope.llm.dataloader import get_tokenizer
model_config = config.model
model_name, model_hub = model_config.type.split('@')
if model_hub == 'huggingface_llm':
model = get_model_from_huggingface(model_name=model_name,
config=config)
elif model_hub == 'modelscope_llm':
model = get_model_from_modelscope(model_name=model_name, config=config)
else:
raise NotImplementedError(f'Not support LLM {model_name} in'
f' {model_hub}.')
# Resize LLM model based on settings
tokenizer, num_new_tokens = \
get_tokenizer(model_name, config.data.root, config.llm.tok_len,
model_hub)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
args = config.llm.adapter.args[0] if len(
config.llm.adapter.args[0]) > 0 else {}
model = AdapterModel(model, use_adapter=config.llm.adapter.use, **args)
return model
import torch
import torch.nn as nn
from collections import OrderedDict
from federatedscope.llm.MyUtils.qlora_util import find_all_linear_names
def enable_adapter(model, package, adapter, **kwargs):
"""
Enables an adapter for a given model and package.
Args:
model: A pre-trained model from HuggingFace Transformers library.
package: A string indicating the name of the package that provides
the adapter. Currently, only 'peft' and 'adapterhub' is supported.
adapter: A string indicating the name of the adapter to enable. The
available adapters depend on the package.
**kwargs: Additional keyword arguments that are passed to the
adapter configuration.
Returns:
A model object that has the adapter enabled.
Raises:
NotImplementedError: If the package or the adapter is not supported.
"""
adapter = adapter.lower()
if package == 'peft':
"""
PEFT: https://github.com/huggingface/peft
Support methods:
LoRA
Prefix Tuning
P-Tuning
Prompt Tuning
AdaLoRA
"""
from peft import get_peft_model, TaskType
if adapter == 'lora':
from peft import LoraConfig
peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, **kwargs)
model = get_peft_model(model, peft_config)
elif adapter == 'prefix':
from peft import PrefixTuningConfig
peft_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM,
**kwargs)
model = get_peft_model(model, peft_config)
elif adapter == 'prompt':
from peft import PromptTuningConfig
peft_config = PromptTuningConfig(task_type=TaskType.CAUSAL_LM,
**kwargs)
model = get_peft_model(model, peft_config)
elif adapter == 'p-tuning':
from peft import PromptEncoderConfig
peft_config = PromptEncoderConfig(task_type=TaskType.CAUSAL_LM,
**kwargs)
model = get_peft_model(model, peft_config)
elif adapter == 'qlora':
from peft import prepare_model_for_kbit_training, LoraConfig
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
r=kwargs.pop('r', 32),
lora_alpha=kwargs.pop('lora_alpha', 16),
target_modules=find_all_linear_names(model, int4=kwargs.pop('module_int4', True)),
lora_dropout=kwargs.pop('lora_dropout', 0.05),
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
elif adapter == 'qlora-pissa':
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
model = PeftModel.from_pretrained(model, kwargs.pop('adapter_path', ''), subfolder=kwargs.pop('subfolder', 'pissa_init'), is_trainable=True)
for name, params in model.named_parameters():
if "embed_tokens" in name or "lm_head" in name:
params.requires_grad=False
else:
raise NotImplementedError
model.print_trainable_parameters()
elif package == 'adapterhub':
"""
AdapterHub: https://docs.adapterhub.ml/model_overview.html
Support methods:
Bottleneck Adapters
Prefix Tuning
LoRA
Compacter
Adapter Fusion
Invertible Adapters
Parallel block
"""
# TODO: After supporting adapterhub, we will move the following
# parameters in yaml file for users' convenient
if adapter == 'lora':
from transformers.adapters import LoRAConfig
config = LoRAConfig(r=8, alpha=16)
model.add_adapter("lora_adapter", config=config)
model.train_adapter(['lora_adapter'])
elif adapter == 'bottleneck':
from transformers.adapters import AdapterConfig
config = AdapterConfig(mh_adapter=True,
output_adapter=True,
reduction_factor=16,
non_linearity="relu")
model.add_adapter("bottleneck_adapter", config=config)
model.train_adapter(['bottleneck_adapter'])
elif adapter == 'lang':
from transformers.adapters import PfeifferInvConfig
config = PfeifferInvConfig()
model.add_adapter("lang_adapter", config=config)
model.train_adapter(['lang_adapter'])
elif adapter == 'prefix':
from transformers.adapters import PrefixTuningConfig
config = PrefixTuningConfig(flat=False, prefix_length=30)
model.add_adapter("prefix_tuning", config=config)
model.train_adapter(['prefix_tuning'])
elif adapter == 'compacter':
from transformers.adapters import CompacterConfig
config = CompacterConfig()
model.add_adapter("dummy", config=config)
model.train_adapter(['dummy'])
elif adapter == 'ia_3':
from transformers.adapters import IA3Config
config = IA3Config()
model.add_adapter("ia3_adapter", config=config)
model.train_adapter(['ia3_adapter'])
elif adapter == 'union':
from transformers.adapters import AdapterConfig, ConfigUnion
# TODO: configure these args in cfg
config = ConfigUnion(
AdapterConfig(mh_adapter=True,
output_adapter=False,
reduction_factor=16,
non_linearity="relu"),
AdapterConfig(mh_adapter=False,
output_adapter=True,
reduction_factor=2,
non_linearity="relu"),
)
model.add_adapter("union_adapter", config=config)
model.train_adapter(['union_adapter'])
elif adapter == 'mam':
from transformers.adapters import \
ConfigUnion, ParallelConfig, PrefixTuningConfig
config = ConfigUnion(
PrefixTuningConfig(bottleneck_size=800),
ParallelConfig(),
)
model.add_adapter("mam_adapter", config=config)
model.train_adapter(['mam_adapter'])
else:
raise NameError(
f"There is no adapter named {adapter} in {package}")
else:
raise NotImplementedError
return model
class AdapterModel(nn.Module):
"""
A wrapper class for a model that can use adapters for fine-tuning.
This class inherits from torch.nn.Module and implements a wrapper for a
model that can optionally use adapters for fine-tuning. Adapters are small
modules that can be inserted between the layers of a pretrained model and
trained on a specific task, while keeping the original parameters frozen.
This class can use different adapter packages and methods, such as PEFT
and LoRA. It also provides methods for saving and loading the model state
dict, as well as generating text using the model.
Attributes:
model: A torch.nn.Module object that represents the original or
adapted model.
"""
def __init__(self, model, use_adapter=False, *args, **kwargs):
"""
Initializes the wrapper with the given model and arguments.
Args:
model: A torch.nn.Module object that represents the original model.
use_adapter: A boolean indicating whether to use adapters for
fine-tuning. Default is False.
*args: Additional positional arguments to pass to the adapter
package or method.
**kwargs: Additional keyword arguments to pass to the adapter
package or method. These may include adapter_package,
adapter_method, etc.
"""
super().__init__()
self.model = None
if use_adapter:
adapter_package = kwargs.pop('adapter_package', 'peft')
adapter_method = kwargs.pop('adapter_method', 'lora')
self.model = enable_adapter(model, adapter_package, adapter_method,
**kwargs)
else:
self.model = model
def forward(self, *args, **kwargs):
"""
Calls the forward method of the wrapped model.
Args:
*args: Positional arguments to pass to the model's forward method.
**kwargs: Keyword arguments to pass to the model's forward method.
Returns:
The output of the model's forward method.
"""
return self.model.forward(*args, **kwargs)
def generate(self, *args, **kwargs):
"""
Calls the generate method of the wrapped model.
Args:
*args: Positional arguments to pass to the model's generate method.
**kwargs: Keyword arguments to pass to the model's generate method.
Returns:
The output of the model's generate method.
"""
try:
res = self.model.generate(*args, **kwargs)
except RuntimeError as e:
# When does evaluation in HELM,
# half precision will cause RuntimeError,
# the following solves it
if 'do_sample' in kwargs.keys():
del kwargs['do_sample']
res = self.model.generate(*args, **kwargs)
else:
raise RuntimeError(e)
return res
def state_dict(self, return_trainable=True, *args, **kwargs):
"""
Returns the state dict of the wrapped model.
Args:
return_trainable: A boolean indicating whether to return only the
trainable parameters of the model. Default is True.
*args: Additional positional arguments to pass to the model's
state_dict method.
**kwargs: Additional keyword arguments to pass to the model's
state_dict method.
Returns:
A dictionary containing the state dict of the model. If
return_trainable is True, only the parameters that require grad are
included. Otherwise, all parameters are included.
"""
if return_trainable:
return self.get_trainable_state_dict()
else:
return self.model.state_dict(*args, **kwargs)
def load_state_dict(self, state_dict, strict=False):
"""
Loads the state dict into the wrapped model.
Args:
state_dict: A dictionary containing the state dict to load into
the model.
strict: A boolean indicating whether to strictly enforce that the
keys in state_dict match the keys returned by this module’s
state_dict() function. Default is False.
"""
return self.model.load_state_dict(state_dict, strict=False)
def get_trainable_state_dict(self):
"""
Returns only the trainable parameters of the wrapped model.
This method can be used to get only the parameters that require grad,
such as adapters or task-specific layers.
Returns:
A dictionary containing the state dict of the trainable parameters
of the model.
"""
grad_params = []
for name, param in self.model.named_parameters():
if param.requires_grad:
grad_params.append(name)
model_state_dict = self.model.state_dict()
new_state_dict = OrderedDict()
for k, v in model_state_dict.items():
if k in grad_params:
new_state_dict[k] = v
return new_state_dict
def save_model(self, path, state=0):
"""
Saves the model state dict and the current round to a file.
Args:
path: A string representing the file path to save the model to.
state: An integer representing the current round of training or
evaluation. Default is 0.
"""
ckpt = {'cur_round': state, 'model': self.model.state_dict()}
torch.save(ckpt, path)
# TODO: Fix `__getattr__`
# def __getattr__(self, item):
# return getattr(self.model, item)
def save_pretrained(self, path):
"""
Saves the pretrained model to a directory.
Args:
path: A string representing the directory path to save the model to.
"""
self.model.save_pretrained(path)
By the way, the "save_to" attribute seems not able to really save the model when using adapter, the save_model function in federated/core/aggregators/aggregator.py only saves the weights:
def save_model(self, path, cur_round=-1):
assert self.model is not None
ckpt = {'cur_round': cur_round, 'model': self.model.state_dict()}
torch.save(ckpt, path)
probably it is better to use self.model.save_pretrained in this case since it saves the finetuned adapter weights and the configuration files.
Thank you very much!
sorry, just accidentally clicked the close issue button...
I encountered the same problem. On the dolly dataset, each time a client trained, the video memory usage increased by about 2.5GB, resulting in my 24GB 4090 machine only being able to support about 8 clients training, and even unable to complete a round
Algorithm 1 in the appendix of your article indicates that the client calculates the latest model by accumulating (seed-gradient) values at the beginning of each round of training. However, in the code you provided, the model update is performed by the server after the (seed-gradient) values of each client are collected. When each client is ready to train, server.model is deepcopied as a parameter to the client, and then the client executes "self.model = pulled_model self.model.to(self.device)". Is this the reason for the excessive use of video memory? Is the code version you provided incorrect?
However, when I set the number of clients to 1 and the sampling rate to 1, the memory usage does not fluctuate much regardless of how many rounds of training I perform.
I encountered the same problem. On the dolly dataset, each time a client trained, the video memory usage increased by about 2.5GB, resulting in my 24GB 4090 machine only being able to support about 8 clients training, and even unable to complete a round
Yes, especially for larger models. I've tried torch.cuda.empty_cache() and gc.collect(), it did help in a way but not really solved the problem. Like you mentioned, it might have something to do with the training process. Please let me know if you figure out how to fix it. Thanks! ^^
I encountered the same problem. On the dolly dataset, each time a client trained, the video memory usage increased by about 2.5GB, resulting in my 24GB 4090 machine only being able to support about 8 clients training, and even unable to complete a round
Yes, especially for larger models. I've tried torch.cuda.empty_cache() and gc.collect(), it did help in a way but not really solved the problem. Like you mentioned, it might have something to do with the training process. Please let me know if you figure out how to fix it. Thanks! ^^
I even think this is related to pytorch’s memory release mechanism, or maybe it’s related to the machine.If you find a good solution, please share it, thank you.
By chance, I ran the code on another 4-GPU 4090 machine, and I found that the video memory problem disappeared, the garbage video memory was recycled in time, and the model was trained normally. Therefore, this problem has nothing to do with the code, but may be related to hardware settings, driver versions, etc. Good luck.
By chance, I ran the code on another 4-GPU 4090 machine, and I found that the video memory problem disappeared, the garbage video memory was recycled in time, and the model was trained normally. Therefore, this problem has nothing to do with the code, but may be related to hardware settings, driver versions, etc. Good luck.
That's great. But I got this problem on 4090. Which driver version are you using? I'll have a try :)
The gpu memory usage continues to increase after each round while finetuning LLM with an adapter. The gpu memory increment after each round was approximately the same. I speculate it's because that there are new clients joining in each round and there would be new model parameters. I've already set share_local_model and llm.adapter.mv_to_cpu to True, it should move the adapter to cpu after each round but why would the gpu memory still increase? I'd appreaciate it if anyone could help me with this issue. Thanks in advance!