fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
439 stars 35 forks source link

Pretraining inquiry. #32

Closed gyupro closed 8 months ago

gyupro commented 9 months ago

Hi, thanks for sharing your great work.

I am working on pretraining LLaMA 2-7B-HF with Oscar data, but it gives me an OOM error(using 2xA100 80GB GPUs). Setting the offload_optimizer_device to CPU results in 10,000 hours (using 2xA100 80GB GPUs). I was wondering how much RAM is needed to pretrain LLaMA 2 7B HF using your method?

My settings were:

mono_ft :

OUTPUT_DIR=${1:-"./llama2-7b-oscar-ft"} export HF_DATASETS_CACHE=".cache/datasets/"

port=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 )) accelerate launch --main_process_port ${port} --config_file configs/deepspeed_train_config_bf16.yaml \ run_llmmt.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --oscar_data_path oscar-corpus/OSCAR-2301 \ --oscar_data_lang en,es,ja,ko \ --interleave_probs "0.3,0.2,0.2,0.3" \ --streaming \ --max_steps 600000 \ --bf16 \ --do_train \ --low_cpu_mem_usage \ --learning_rate 2e-5 \ --weight_decay 0.01 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.01 \ --ignore_pad_token_for_loss \ --ignore_prompt_token_for_loss \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --save_strategy steps \ --save_steps 2000 \ --save_total_limit 1 \ --logging_strategy steps \ --logging_steps 1 \ --output_dir ${OUTPUT_DIR} \ --max_new_tokens 256 \ --max_source_length 256 \ --seed 42 \ --overwrite_output_dir \ --report_to none

deepspeed)train_config_bf16 :

compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: cpu zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 1 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

fe1ixxu commented 8 months ago

Hi, thanks for your interest!

A100 should be enough to full-weight fine-tune on 7B models.

  1. Something in my mind that you can try is reinstalling an old version of env which is the same as the one used for training ALMA: https://github.com/fe1ixxu/ALMA/blob/a3cc7877752779346312bb07798172eadc83d692/install_alma.sh

  2. deepspeed config I set here is zero2, which means it does not support parameter sharding. You may want to try deepspeed Zero3 or FSDP. An example of fsdp config you can use is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
gyupro commented 8 months ago

Thx for your info. I will look into it this issue.

gyupro commented 8 months ago

@fe1ixxu I found the bug. I was trying to use gemma7-b and it turns out, it requires more memory to train. I think it's because about vocab size? not sure, but llama2-7b-hf works well.