eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
24 stars 6 forks source link

Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected #31

Closed randy-ac closed 1 month ago

randy-ac commented 1 month ago

Hello,

I am trying to finetune a llama3-8b model on 2 gpus but I keep getting the following error:

Traceback (most recent call last):
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/utils/distributed.py", line 179, in spawned_train
    process_fn(config, device_id=device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/train_single.py", line 169, in main
    model, _, _ = get_model_class(config.model).from_config(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 495, in from_config
    model.training_logic(running_config, vocabs, checkpoint, device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 288, in training_logic
    self.load_checkpoint(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 248, in load_checkpoint
    self.load_safe_state_dict(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 706, in load_safe_state_dict
    self._load_param(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 572, in _load_param
    param.data.size()
AssertionError: An error in model's partition and checkpoint's slice was detected

Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt

I got this error both on commit 4954c124099c7fa6d9629ab40475e35476b8e10c and on commit 7077ddf3f1f64bf0c481078e03d68f05612f5d04. I also tried to run this on two different pairs of gpus but the result did not change.

Yesterday I had launched the exact same fine-tuning and it run fine (besides the tensor parallel model issue that was fixed in the meantime).

Do you have any hint as to why this could be happening?

Thanks

vince62s commented 1 month ago

post your config, but maybe @l-k-11235 can help she tested this morning and was fine.

randy-ac commented 1 month ago

Thanks for the reply. Here are my configs. I've already checked with Lina but we were not able to identify the issue

General settings

seed: 1234 share_vocab: true save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune" src_vocab: "${EOLE_MODEL_DIR}/llama3-8b/vocab.txt" # size src_vocab_size: 128256 tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune/logs/

transforms config

transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs: insert_mask_before_placeholder: response_patterns: ["⦅newline⦆⦅newline⦆### Response : ⦅newline⦆"] onmt_tokenize: src_subword_type: bpe src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b/bpe.model" tgt_subword_type: bpe tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b/bpe.model" gpt2_pretok: true filtertoolong: src_seq_length: 2048 tgt_seq_length: 2048

datasets

data: new_synth_dataset: path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset/synthetic-dataset-with-roles_train.shuffle" weight: 1 valid: path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset/synthetic-dataset-with-roles_dev.shuffle"

skip_empty_level: silent # silently ignore empty lines in the data

training:

GPU dispatching

world_size: 2
gpu_ranks: [0, 1]

parallel_mode: "tensor_parallel"
zero_out_prompt_loss: true

train_steps: 20000
valid_steps: 200

dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]
# Batching
bucket_size: 10
num_workers: 1
batch_type: "sents"
batch_size: 1
valid_batch_size: 1
batch_size_multiple: 1

# Optimization
model_dtype: "fp16"
apex_opt_level: ""
optim: "fusedadam"
learning_rate: 2e-05
warmup_steps: 100
decay_method: "none"
#learning_rate_decay: 0.98
#start_decay_steps: 100
#decay_steps: 10
adam_beta2: 0.998
accum_count: [16] #[8]
accum_steps: [0]
max_grad_norm: 0
label_smoothing: 0.0
param_init: 0
param_init_glorot: true
normalization: "tokens"

# folders
train_from: "${EOLE_MODEL_DIR}/llama3-8b"
model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune"
keep_checkpoint: 30
save_checkpoint_steps: 500

# 4/8bit
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"

# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 16 #5 #2
lora_dropout: 0.05
lora_alpha: 32
lora_embedding: false
vince62s commented 1 month ago

You need to rename w_1 2 and 3

vince62s commented 1 month ago

also git pull, last fix just pushed