[BUG] ValueError(f"Could not find the transformer layer class {layer_class} in the model.")

Vanessa-Taing commented 1 week ago

Objective: To train and evaluate a model on RAGTruth dataset

Settings: OS: Ubuntu WSL Python: 3.12.4 NVIDIA Driver Version: 536.23 CUDA Version: 12.2

Replication steps:

Git clone
Run python prepare_dataset.py:

(14942, 12)
Summary     300
Data2txt    300
QA          295
Name: task_type, dtype: int64
C:\Users\CSOC\Documents\lrp4rag\RAGTruth\baseline\prepare_dataset.py:61: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dev['fold']=-1
(2675, 12)

Run command:

CUDA_VISIBLE_DEVICES=0 torchrun --nnodes 1 --nproc_per_node 4 train.py \
--model_name_or_path akjindal53244/Llama-3.1-Storm-8B \
--output_dir ./exp/baseline \
--do_train \
--dataset detect_yesno \
--num_train_epochs 1 \
--learning_rate 2e-5 \
--drop_neg_ratio -1 \
--train_file ./train.jsonl \
--eval_file ./dev.jsonl \
--bf16 True \
--tf32 False \
--use_flashatt_2 False \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 1 \
--model_max_length 4096 \
--ddp_find_unused_parameters False \
--logging_steps 1 \
--run_name baseline \
--lr_scheduler_type 'cosine' \
--warmup_ratio 0.1 \
--save_steps 10000 \
--save_total_limit 2 \
--overwrite_output_dir \
--eval_strategy steps \
--eval_steps 80 \
--fsdp "shard_grad_op auto_wrap" \
--fsdp_config ./configs/fsdp.json

Model downloaded:

  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
tokenizer_config.json: 100%|█████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 285kB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████| 9.09M/9.09M [00:01<00:00, 4.91MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████| 296/296 [00:00<00:00, 3.71MB/s]
model-00004-of-00004.safetensors: 100%|█████████████████████████████████████| 1.19G/1.19G [01:17<00:00, 15.3MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████| 4/4 [12:41<00:00, 190.30s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.65it/s]
generation_config.json: 100%|███████████████████████████████████████████████████| 185/185 [00:00<00:00, 2.56MB/s]

Error log:

[rank0]:     raise ValueError(f"Could not find the transformer layer class {layer_class} in the model.")
[rank0]: ValueError: Could not find the transformer layer class L in the model.
E0905 11:00:13.696000 139999731015680 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1)

Thank you for the great work, really appreciate if you could help with the above issue.

thuwyh commented 1 week ago

Can you run the original training command? It seems you changed the model name.

Vanessa-Taing commented 1 week ago

Thank you for the speed reply, I changed the training command to:

CUDA_VISIBLE_DEVICES=0 python train.py \
--model_name_or_path akjindal53244/Llama-3.1-Storm-8B \
--output_dir ./exp/llama3_storm8b_baseline \
--do_train \
--dataset detect_yesno \
--num_train_epochs 1 \
--learning_rate 2e-5 \
--drop_neg_ratio -1 \
--train_file ./train.jsonl \
--eval_file ./dev.jsonl \
--bf16 True \
--tf32 False \
--use_flashatt_2 False \
--use_peft True \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--model_max_length 4096 \
--logging_steps 1 \
--run_name llama3_storm8b_baseline \
--lr_scheduler_type 'cosine' \
--warmup_ratio 0.1 \
--save_steps 10000 \
--save_total_limit 2 \
--overwrite_output_dir \
--eval_strategy steps \
--eval_steps 80
--lora_r 8 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--target_modules "q_proj,v_proj,k_proj,gate_proj,up_proj,down_proj" \
--load_in_8bit True

Which the code gets running with wandb showing the progress, but soon the process terminated:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free.

I think this is not a code problem, rather its my device incompatibility? I am using NVIDIA GeForce RTX 4090 fyi.

P.S. I changed the model name because I wanted to try with that specific model on the RAGTruth training. Is that a valid way?

Thanks!

thuwyh commented 1 week ago

One 4090 may be a problem for training. We trained our model with 4 A100 80G GPUs.

ParticleMedia / RAGTruth

[BUG] ValueError(f"Could not find the transformer layer class {layer_class} in the model.") #9