Open FanWan opened 5 months ago
hello,, I used your method of expanding the context length from 4K to 8K, and trained Llama2-13B, but got really bad performance.
the following is training script which is similar to yours:
export WANDB_MODE=disabled
export CUDA_VISIBLE_DEVICES=2,3,4,5,6
data_dir=/home/minio/gpu-model-jc/gpu-model-jc/jchluo/FastChat/data/fc_agent/dataset-0131/ train_data=${data_dir}/train.fc.0131.json test_data=${data_dir}/eval.fc.0131.json base_model_path=/home/minio/gpu-model-jc/gpu-model-jc/llama2/Llama-2-13b-hf model_name=llama2-13b-agent_0201
torchrun --nproc_per_node=5 --master_port=20001 toolbench/train/train_mem.py \ --model_name_or_path ${base_model_path} \ --data_path ${train_data} \ --eval_data_path ${test_data} \ --conv_template tool-llama \ --bf16 True \ --output_dir output/${model_name} \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "epoch" \ --prediction_loss_only \ --save_strategy "epoch" \ --save_total_limit 8 \ --learning_rate 5e-5 \ --weight_decay 0. \ --warmup_ratio 0.04 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --source_model_max_length 4096 \ --model_max_length 8192 \ --gradient_checkpointing True \ --lazy_preprocess True \ --report_to none > ./logs/${model_name}.log 2>&1 &
So, I wonder if there are some problems with the method of expanding the context length ??
hello,, I used your method of expanding the context length from 4K to 8K, and trained Llama2-13B, but got really bad performance.
the following is training script which is similar to yours:
export WANDB_MODE=disabled
export CUDA_VISIBLE_DEVICES=2,3,4,5,6
data_dir=/home/minio/gpu-model-jc/gpu-model-jc/jchluo/FastChat/data/fc_agent/dataset-0131/ train_data=${data_dir}/train.fc.0131.json test_data=${data_dir}/eval.fc.0131.json base_model_path=/home/minio/gpu-model-jc/gpu-model-jc/llama2/Llama-2-13b-hf model_name=llama2-13b-agent_0201
torchrun --nproc_per_node=5 --master_port=20001 toolbench/train/train_mem.py \ --model_name_or_path ${base_model_path} \ --data_path ${train_data} \ --eval_data_path ${test_data} \ --conv_template tool-llama \ --bf16 True \ --output_dir output/${model_name} \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "epoch" \ --prediction_loss_only \ --save_strategy "epoch" \ --save_total_limit 8 \ --learning_rate 5e-5 \ --weight_decay 0. \ --warmup_ratio 0.04 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --source_model_max_length 4096 \ --model_max_length 8192 \ --gradient_checkpointing True \ --lazy_preprocess True \ --report_to none > ./logs/${model_name}.log 2>&1 &
So, I wonder if there are some problems with the method of expanding the context length ??