Open Vanessa-Taing opened 1 week ago
Can you run the original training command? It seems you changed the model name.
Thank you for the speed reply, I changed the training command to:
CUDA_VISIBLE_DEVICES=0 python train.py \
--model_name_or_path akjindal53244/Llama-3.1-Storm-8B \
--output_dir ./exp/llama3_storm8b_baseline \
--do_train \
--dataset detect_yesno \
--num_train_epochs 1 \
--learning_rate 2e-5 \
--drop_neg_ratio -1 \
--train_file ./train.jsonl \
--eval_file ./dev.jsonl \
--bf16 True \
--tf32 False \
--use_flashatt_2 False \
--use_peft True \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--model_max_length 4096 \
--logging_steps 1 \
--run_name llama3_storm8b_baseline \
--lr_scheduler_type 'cosine' \
--warmup_ratio 0.1 \
--save_steps 10000 \
--save_total_limit 2 \
--overwrite_output_dir \
--eval_strategy steps \
--eval_steps 80
--lora_r 8 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--target_modules "q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj" \
--load_in_8bit True
Which the code gets running with wandb showing the progress, but soon the process terminated:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free.
I think this is not a code problem, rather its my device incompatibility? I am using NVIDIA GeForce RTX 4090 fyi.
P.S. I changed the model name because I wanted to try with that specific model on the RAGTruth training. Is that a valid way?
Thanks!
One 4090 may be a problem for training. We trained our model with 4 A100 80G GPUs.
Objective: To train and evaluate a model on RAGTruth dataset
Settings: OS: Ubuntu WSL Python: 3.12.4 NVIDIA Driver Version: 536.23 CUDA Version: 12.2
Replication steps:
python prepare_dataset.py
:Model downloaded:
Error log:
Thank you for the great work, really appreciate if you could help with the above issue.