Closed gyupro closed 8 months ago
Hi, thanks for your interest!
A100 should be enough to full-weight fine-tune on 7B models.
Something in my mind that you can try is reinstalling an old version of env which is the same as the one used for training ALMA: https://github.com/fe1ixxu/ALMA/blob/a3cc7877752779346312bb07798172eadc83d692/install_alma.sh
deepspeed config I set here is zero2, which means it does not support parameter sharding. You may want to try deepspeed Zero3 or FSDP. An example of fsdp config you can use is:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Thx for your info. I will look into it this issue.
@fe1ixxu I found the bug. I was trying to use gemma7-b and it turns out, it requires more memory to train. I think it's because about vocab size? not sure, but llama2-7b-hf works well.
Hi, thanks for sharing your great work.
I am working on pretraining LLaMA 2-7B-HF with Oscar data, but it gives me an OOM error(using 2xA100 80GB GPUs). Setting the offload_optimizer_device to CPU results in 10,000 hours (using 2xA100 80GB GPUs). I was wondering how much RAM is needed to pretrain LLaMA 2 7B HF using your method?
My settings were:
mono_ft :
OUTPUT_DIR=${1:-"./llama2-7b-oscar-ft"} export HF_DATASETS_CACHE=".cache/datasets/"
port=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 )) accelerate launch --main_process_port ${port} --config_file configs/deepspeed_train_config_bf16.yaml \ run_llmmt.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --oscar_data_path oscar-corpus/OSCAR-2301 \ --oscar_data_lang en,es,ja,ko \ --interleave_probs "0.3,0.2,0.2,0.3" \ --streaming \ --max_steps 600000 \ --bf16 \ --do_train \ --low_cpu_mem_usage \ --learning_rate 2e-5 \ --weight_decay 0.01 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --warmup_ratio 0.01 \ --ignore_pad_token_for_loss \ --ignore_prompt_token_for_loss \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --save_strategy steps \ --save_steps 2000 \ --save_total_limit 1 \ --logging_strategy steps \ --logging_steps 1 \ --output_dir ${OUTPUT_DIR} \ --max_new_tokens 256 \ --max_source_length 256 \ --seed 42 \ --overwrite_output_dir \ --report_to none
deepspeed)train_config_bf16 :
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: cpu zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 1 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false