Open albertoperdomo2 opened 1 month ago
I think the first thing that is confusing me, is that if you are keeping everything constant, and increasing the number of GPUs. then we do not expect memory consumption to increase. So can you post the arguments you are using to perform the different experiments with different number of GPUs
@fabianlim these are the configs that we used for all the tests:
The only variable here was the amount of GPUs.
{
"training_data_path": "/mnt/output/dataset.json",
"model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3-gptq",
"response_template": " ### Label:",
"output_dir": "/mnt/output/fine-tuning",
"save_model_dir": "/mnt/output/save-model-dir",
"accelerate_launch_args": {
"num_processes": 1,
"num_machines": 1,
"mixed_precision": "no",
"dynamo_backend": "no",
"downcast_bf16": "no",
"main_training_function": "main",
"rdzv_backend": "static",
"same_network": true,
"tpu_use_sudo": false
},
"num_train_epochs": 1,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"eval_strategy": "no",
"save_strategy": "no",
"learning_rate": 0.00001,
"weight_decay": 0,
"lr_scheduler_type": "cosine",
"max_seq_length": 1024,
"include_tokens_per_second": true,
"dataset_text_field": "output",
"use_flash_attn": true,
"auto_gptq": [
"triton_v2"
],
"fp16": true,
"gradient_checkpointing": true,
"lora_alpha": 16,
"max_steps": -1,
"packing": false,
"peft_method": "lora",
"r": 4,
"target_modules": [
"all-linear"
],
"torch_dtype": "float16",
"warmup_ratio": 0.03
}
The only variable here was the amount of GPUs.
{
"training_data_path": "/mnt/output/dataset.json",
"model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3",
"response_template": " ### Label:",
"output_dir": "/mnt/output/fine-tuning",
"save_model_dir": "/mnt/output/save-model-dir",
"accelerate_launch_args": {
"num_processes": 1,
"num_machines": 1,
"mixed_precision": "no",
"dynamo_backend": "no",
"downcast_bf16": "no",
"main_training_function": "main",
"rdzv_backend": "static",
"same_network": true,
"tpu_use_sudo": false
},
"num_train_epochs": 1,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"eval_strategy": "no",
"save_strategy": "no",
"learning_rate": 0.00001,
"weight_decay": 0,
"lr_scheduler_type": "cosine",
"max_seq_length": 1024,
"include_tokens_per_second": true,
"dataset_text_field": "output",
"use_flash_attn": true,
"gradient_checkpointing": true,
"lora_alpha": 16,
"max_steps": -1,
"packing": false,
"peft_method": "lora",
"r": 4,
"target_modules": [
"all-linear"
],
"warmup_ratio": 0.03
}
@anhuong do you know how come in accelerate_launch_args
the FSDP sharding strategy is not specified. does it go to a default?
@albertoperdomo2 if you see in our benches, our settings are quite similar to yours, but you can see that when num_gpus
go up from 1 to 2, the memory consumption decreases
@fabianlim we have seen this behavior mainly by this particular model pair. We are planning on testing different equivalent models but I wonder if this particular model itself might be the issue. Do you have results for mistralai/Mistral-7B-v0.3
/mistral-7b-v0.3-gptq
?
@albertoperdomo2 no im sorry.
Describe the bug
When validating the
fms-hf-tuning v2.0.1
image, we ran our workloads across different GPU counts to review improvements associated with it. One thing that we tried was fine tuning using LoRA + full precision model (in this casemistralai/Mistral-7B-v0.3
) and QLoRA + quantized model (in this casemistral-7b-v0.3-gptq
) with the same settings to analyze the results, and we found out that with 8 GPUs, the QLoRA GPU memory usage was greater than the LoRA equivalent.Platform
RHOAI 2.12
Expected behavior
When running LoRA (with a full precision model) and QLoRA fine tuning (with the same model but quantized) the GPU memory usage is expected to always be lower in QLoRA, given the fact that the model parameters are in a lower precision.