Closed randy-ac closed 1 month ago
post your config, but maybe @l-k-11235 can help she tested this morning and was fine.
Thanks for the reply. Here are my configs. I've already checked with Lina but we were not able to identify the issue
seed: 1234 share_vocab: true save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune" src_vocab: "${EOLE_MODEL_DIR}/llama3-8b/vocab.txt" # size src_vocab_size: 128256 tgt_vocab_size: 128256
overwrite: true
report_every: 10
n_sample: 0
tensorboard: true tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune/logs/
transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]
transforms_configs: insert_mask_before_placeholder: response_patterns: ["⦅newline⦆⦅newline⦆### Response : ⦅newline⦆"] onmt_tokenize: src_subword_type: bpe src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b/bpe.model" tgt_subword_type: bpe tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b/bpe.model" gpt2_pretok: true filtertoolong: src_seq_length: 2048 tgt_seq_length: 2048
data: new_synth_dataset: path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset/synthetic-dataset-with-roles_train.shuffle" weight: 1 valid: path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset/synthetic-dataset-with-roles_dev.shuffle"
skip_empty_level: silent # silently ignore empty lines in the data
training:
world_size: 2
gpu_ranks: [0, 1]
parallel_mode: "tensor_parallel"
zero_out_prompt_loss: true
train_steps: 20000
valid_steps: 200
dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]
# Batching
bucket_size: 10
num_workers: 1
batch_type: "sents"
batch_size: 1
valid_batch_size: 1
batch_size_multiple: 1
# Optimization
model_dtype: "fp16"
apex_opt_level: ""
optim: "fusedadam"
learning_rate: 2e-05
warmup_steps: 100
decay_method: "none"
#learning_rate_decay: 0.98
#start_decay_steps: 100
#decay_steps: 10
adam_beta2: 0.998
accum_count: [16] #[8]
accum_steps: [0]
max_grad_norm: 0
label_smoothing: 0.0
param_init: 0
param_init_glorot: true
normalization: "tokens"
# folders
train_from: "${EOLE_MODEL_DIR}/llama3-8b"
model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune"
keep_checkpoint: 30
save_checkpoint_steps: 500
# 4/8bit
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"
# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 16 #5 #2
lora_dropout: 0.05
lora_alpha: 32
lora_embedding: false
You need to rename w_1 2 and 3
also git pull, last fix just pushed
Hello,
I am trying to finetune a llama3-8b model on 2 gpus but I keep getting the following error:
I got this error both on commit 4954c124099c7fa6d9629ab40475e35476b8e10c and on commit 7077ddf3f1f64bf0c481078e03d68f05612f5d04. I also tried to run this on two different pairs of gpus but the result did not change.
Yesterday I had launched the exact same fine-tuning and it run fine (besides the tensor parallel model issue that was fixed in the meantime).
Do you have any hint as to why this could be happening?
Thanks