VinAIResearch / PhoGPT

PhoGPT: Generative Pre-training for Vietnamese (2023)
Apache License 2.0
739 stars 67 forks source link

OutOfMemoryError when Fine-tuning model PhoGPT4B with llm-foundry #28

Closed VX-Anh closed 4 months ago

VX-Anh commented 4 months ago

Hello. Thanks for your work very much. I have carefully reviewed and adhered to the guidelines provided for fine-tuning the PhoGPT4B model. Nonetheless, despite specifying a maximum sequence length of 2048 and a global training batch size of 1, I am encountering Out of Memory issues. My GPU is RTX4090 24G. Do you have any ideas to solve this problem? image

Wishing you all the best!

datquocnguyen commented 4 months ago

You might want to adjust fsdp_config values, e.g. activation_checkpointing, in the yaml file. @VX-Anh

datquocnguyen commented 4 months ago

From: https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md#im-running-into-an-out-of-memory-oom-error-what-do-i-do

If OOMs persist with device_train_microbatch_size: 1 and device_eval_batch_size: 1, you may need to use activation checkpointing fsdp_config.activation_checkpointing: true (if you are not already) and, as a last resort, activation CPU offloading fsdp_config.activation_cpu_offload: true.

And llmfoundry also supports LoRA/PEFT fine-tuning: https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md#can-i-finetune-using-peft--lora

VX-Anh commented 4 months ago

Hello @datquocnguyen. Thanks for your advice. However I still get the OOM error after having done all these methods including setting activation_checkpointing, activation_cpu_offload and using Lora.

My file config fine-tuning-phogpt.yaml

max_seq_len: 2048 # Or 2048 or 4096.
global_seed: 1337

# Run Name
run_name: fine-tuning-phogpt-4b

# Model
model:
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: vinai/PhoGPT-4B
  config_overrides:
    max_seq_len: ${max_seq_len}
    attn_config:
      attn_impl: flash
      alibi: true
      prefix_lm: false
      attn_uses_sequence_id: false
  peft_config:
    r: 16
    peft_type: LORA
    task_type: CAUSAL_LM
    lora_alpha: 32
    lora_dropout: 0.05
    target_modules:
      - Wqkv

# Tokenizer
tokenizer:
  name: vinai/PhoGPT-4B
  kwargs:
    model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
  name: finetuning
  dataset:
    hf_name: /home/PhoGPT/sample_instruction_following_dataset
    split: train
    shuffle: true
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    decoder_only_format: true
    allow_pad_trimming: false
  drop_last: true
  num_workers: 8

# Optimization
scheduler:
  name: cosine_with_warmup # Or using: linear_decay_with_warmup
  t_warmup: 200ba # To be adjusted, for example: 1/20 the total number of training steps
  alpha_f: 0.1

optimizer:
  name: decoupled_lionw
  lr: 5e-5 # To be adjusted
  betas:
  - 0.9
  - 0.98
  weight_decay: 1e-7

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 3ep
eval_interval: 1
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 1 # To be adjusted, for example: 16 * the number of GPUs

# System
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
precision: amp_bf16

# FSDP
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: true
  limit_all_gathers: true
  verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 10ba

callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}

# Checkpoint to local filesystem or remote object store
save_interval: 1ep
save_num_checkpoints_to_keep: 3  # Important, this cleans up checkpoints saved to DISK
save_folder: /home/PhoGPT/checkpoints
datquocnguyen commented 4 months ago

It's weird. The model takes 7GB of GPU memory when loaded with float16. It should work fine with 24GB of memory when doing (full-weight/LoRA) fine-tuning.

UserWarning: gpu_flop count not found for nvidia geforce rtx 4090 with precision=amp_bf16 so MFU cannot be calculated and reported. gpu_flops_available can be manually overridden by setting gpu_flops_available in SpeedMonitor or nvidia geforce rtx 4090 can be added to GPU_AVAILABLE_FLOPS in composer/callbacks/speed_monitor.py self.gpu_flops_available = get_gpu_flops_available(state)

=> You may want to change the value of precision in the YAML file, trying bf16, fp16, float16, or bfloat16 instead of amp_bf16. @VX-Anh

VX-Anh commented 4 months ago

Thank you for your advice @datquocnguyen . I tried it, but it did not work. When specifying bf16, fp16, float16, or bfloat16, I received a ValueError with the message: ValueError: Value fp16 not found in Precision . With amp_fp16, result are the same as with amp_bf16. From composer trainer, only amp_bf16 and amp_fp16 are supported for training with GPU precision. precision (Precision | str, optional): Numerical precision to use for training. One of ``fp32``, ``amp_bf16`` or ``amp_fp16`` (recommended). (default: ``Precision.FP32`` if training on CPU; ``Precision.AMP_FP16`` if training on GPU) Based on the this code from composer, it seems that gpu_flop only supports devices such as A100, H100, V100, T4, and TRN1.

datquocnguyen commented 4 months ago

I see. Thanks for pointing out. If composer does not support 4090 GPU, you might want to user other fine-tuning frameworks.