axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.58k stars 822 forks source link

LLama3 SFT training error #1570

Closed amitagh closed 4 months ago

amitagh commented 4 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

LLama3 8b Chat model SFT training with Lora Failing. SFT trng with alpaca dataset should work.

Current behaviour

Below error is seen: 0%| | 0/284 [00:00<?, ?it/s][2024-04-26 14:26:03,687] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:32911] [RANK:0] packing_efficiency_estimate: 0.98 total_num_tokens per device: 7085522 You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding. Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/resource_nvme/venv/src/axolotl/src/axolotl/cli/train.py", line 59, in fire.Fire(do_cli) File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/mnt/resource_nvme/venv/src/axolotl/src/axolotl/cli/train.py", line 35, in do_cli return do_train(parsed_cfg, parsed_cli_args) File "/mnt/resource_nvme/venv/src/axolotl/src/axolotl/cli/train.py", line 55, in do_train return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta) File "/mnt/resource_nvme/venv/src/axolotl/src/axolotl/train.py", line 163, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in train return inner_training_loop( File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2143, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in iter current_batch = next(dataloader_iter) File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/mnt/resource_nvme/venv/src/axolotl/src/axolotl/monkeypatch/data/batch_dataset_fetcher.py", line 32, in fetch return self.collate_fn(data) File "/mnt/resource_nvme/venv/src/axolotl/src/axolotl/utils/collators.py", line 154, in call return super().call(out_features, return_tensors=return_tensors) File "/mnt/resource_nvme/venv/src/axolotl/src/axolotl/utils/collators.py", line 106, in call features = self.tokenizer.pad( File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3315, in pad paddingstrategy, , maxlength, = self._get_padding_truncation_strategies( File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2763, in _get_padding_truncation_strategies raise ValueError( ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}). 0%| | 0/284 [00:01<?, ?it/s] Traceback (most recent call last): File "/mnt/resource_nvme/venv/bin/accelerate", line 8, in sys.exit(main()) File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1057, in launch_command simple_launcher(args) File "/mnt/resource_nvme/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/mnt/resource_nvme/venv/bin/python3.10', '-m', 'axolotl.cli.train', 'ax_cfg.yml']' returned non-zero exit status 1

Steps to reproduce

SFT Train Llama3 8B Chat model on alpaca dataset. While training it fails with the above error. Dataset was preprocessed before training.

Config yaml

base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: ./mar_alpaca_dataset.json
    type: alpaca
    ds_type: json
dataset_prepared_path:
dataset_processes: 16
val_set_size: 0
output_dir: ./lora-out

adapter: lora
lora_model_dir:

gpu_memory_limit: 76

sequence_len: 1048
sample_packing: true
pad_to_sequence_len: true

lora_r: 256
lora_alpha: 256
lora_dropout: 0.05
lora_target_modules: 
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
#lora_modules_to_save:
#  - embed_tokens
#  - lm_head
lora_target_linear: true
lora_fan_in_fan_out:

save_safetensors: True

gradient_accumulation_steps: 2
micro_batch_size: 12
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
warmup_steps: 20
save_steps: 5000

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 200
xformers_attention:
flash_attention: True

evals_per_epoch: 1
eval_table_size:
eval_max_new_tokens: 128
eval_sample_packing: False
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Possible solution

Error shouldnt happen some addition change is needed probably to support SFT for llama3 chat model.

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

latest

Acknowledgements

winglian commented 4 months ago

at the bottom of your YAML, you'll want to set the special tokens as

special_tokens:
   pad_token: <|end_of_text|>
amitagh commented 4 months ago

Thanks. THis would need to add lora_modules_to_save:

Correct?

On Fri, Apr 26, 2024 at 11:41 PM Wing Lian @.***> wrote:

at the bottom of your YAML, you'll want to set the special tokens as

special_tokens: pad_token: <|end_of_text|>

— Reply to this email directly, view it on GitHub https://github.com/OpenAccess-AI-Collective/axolotl/issues/1570#issuecomment-2079872532, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASHD4BAUZ3TV7WPOQISAEZ3Y7KKE3AVCNFSM6AAAAABG3FOOVGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZHA3TENJTGI . You are receiving this because you authored the thread.Message ID: @.***>