foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
26 stars 46 forks source link

Inconsistent GPU memory usage of QLoRA (vs LoRA) with different numbers of GPUs #373

Open albertoperdomo2 opened 1 month ago

albertoperdomo2 commented 1 month ago

Describe the bug

When validating the fms-hf-tuning v2.0.1 image, we ran our workloads across different GPU counts to review improvements associated with it. One thing that we tried was fine tuning using LoRA + full precision model (in this case mistralai/Mistral-7B-v0.3) and QLoRA + quantized model (in this case mistral-7b-v0.3-gptq) with the same settings to analyze the results, and we found out that with 8 GPUs, the QLoRA GPU memory usage was greater than the LoRA equivalent.

mistral_merged_gpu_total_memory_usage_max

Platform

RHOAI 2.12

Expected behavior

When running LoRA (with a full precision model) and QLoRA fine tuning (with the same model but quantized) the GPU memory usage is expected to always be lower in QLoRA, given the fact that the model parameters are in a lower precision.

fabianlim commented 1 month ago

I think the first thing that is confusing me, is that if you are keeping everything constant, and increasing the number of GPUs. then we do not expect memory consumption to increase. So can you post the arguments you are using to perform the different experiments with different number of GPUs

albertoperdomo2 commented 1 month ago

@fabianlim these are the configs that we used for all the tests:

QLoRA config

The only variable here was the amount of GPUs.

{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3-gptq",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 1,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "auto_gptq": [
    "triton_v2"
  ],
  "fp16": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "torch_dtype": "float16",
  "warmup_ratio": 0.03
}

LoRA config

The only variable here was the amount of GPUs.

{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 1,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "warmup_ratio": 0.03
}
fabianlim commented 1 month ago

@anhuong do you know how come in accelerate_launch_args the FSDP sharding strategy is not specified. does it go to a default?

@albertoperdomo2 if you see in our benches, our settings are quite similar to yours, but you can see that when num_gpus go up from 1 to 2, the memory consumption decreases

https://github.com/foundation-model-stack/fms-acceleration/blob/main/scripts/benchmarks/refs/a100_80gb.csv#L102-L103

albertoperdomo2 commented 1 month ago

@fabianlim we have seen this behavior mainly by this particular model pair. We are planning on testing different equivalent models but I wonder if this particular model itself might be the issue. Do you have results for mistralai/Mistral-7B-v0.3/mistral-7b-v0.3-gptq?

fabianlim commented 1 month ago

@albertoperdomo2 no im sorry.