axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.7k stars 851 forks source link

Preprocess failure for llama3 instruct prompt #1614

Closed ryj0902 closed 4 months ago

ryj0902 commented 5 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

python -m axolotl.cli.preprocess test.yaml --debug should be success like below: (different dataset, executed before #1553 is merged)

...
[2024-04-30 04:24:24,115] [DEBUG] [axolotl.normalize_config:79] [PID:67864] [RANK:0] bf16 support detected, enabling for this configuration.                                                                            
[2024-04-30 04:24:24,560] [INFO] [axolotl.normalize_config:182] [PID:67864] [RANK:0] GPU memory usage baseline: 0.000GB (+0.609GB misc)                                                                                 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.                                                                                                   
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:279] [PID:67864] [RANK:0] EOS: 128001 / <|end_of_text|>                                                                                                       
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:280] [PID:67864] [RANK:0] BOS: 128000 / <|begin_of_text|>                                                                                                     
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:281] [PID:67864] [RANK:0] PAD: 128001 / <|end_of_text|>                                                                                                       
[2024-04-30 04:24:25,612] [DEBUG] [axolotl.load_tokenizer:282] [PID:67864] [RANK:0] UNK: None / None                                                                                                                    
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenizer:293] [PID:67864] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.                                                     
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:67864] [RANK:0] Unable to find prepared dataset in /home/llm/data/last_run_prepared/d453405904283e23b947ef88b2e1e328               
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:67864] [RANK:0] Loading raw datasets...                                                                                            
[2024-04-30 04:24:26,497] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:67864] [RANK:0] merging datasets                                                                                                   
[2024-04-30 04:24:26,503] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] min_input_len: 134                                                                                                                              
[2024-04-30 04:24:26,503] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] max_input_len: 134                                                                                                                              
Dropping Long Sequences (num_proc=96): 100%|██████████████████████████████████████| 100/100 [00:00<00:00, 182.17 examples/s]
Add position_id column (Sample Packing) (num_proc=96): 100%|█████████████████████████████████| 100/100 [00:00<00:00, 155.16 examples/s]
[2024-04-30 04:24:29,969] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:67864] [RANK:0] Saving merged prepared dataset to disk... /home/llm/data/last_run_prepared/d453405904283e23b947ef88b2e1e328        
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████| 100/100 [00:00<00:00, 3510.52 examples/s]
[2024-04-30 04:24:30,026] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] total_num_tokens: 12_730                                                                                                                        
[2024-04-30 04:24:30,028] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] `total_supervised_tokens: 380`
...

Current behaviour

fail with error messages below:

...
[2024-05-14 05:16:58,815] [WARNING] [axolotl.utils.config.models.input.hint_lora_8bit:924] [PID:40540] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-05-14 05:16:58,815] [DEBUG] [axolotl.normalize_config:79] [PID:40540] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-05-14 05:16:59,230] [INFO] [axolotl.normalize_config:182] [PID:40540] [RANK:0] GPU memory usage baseline: 0.000GB (+0.047GB misc)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:280] [PID:40540] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:281] [PID:40540] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:282] [PID:40540] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:283] [PID:40540] [RANK:0] UNK: None / None
[2024-05-14 05:17:00,142] [INFO] [axolotl.load_tokenizer:294] [PID:40540] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-05-14 05:17:00,143] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:40540] [RANK:0] Unable to find prepared dataset in /home/work/.test/data/last_run_prepared/d58d05328886fb932ef4b2db9de5724d
[2024-05-14 05:17:00,143] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:40540] [RANK:0] Loading raw datasets...
[2024-05-14 05:17:00,792] [ERROR] [axolotl.get_dataset_wrapper:674] [PID:40540] [RANK:0] unhandled prompt tokenization strategy: sharegpt. 
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/work/.test/axolotl/src/axolotl/cli/preprocess.py", line 82, in <module>
    fire.Fire(do_cli)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/work/.test/axolotl/src/axolotl/cli/preprocess.py", line 72, in do_cli
    load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
  File "/home/work/.test/axolotl/src/axolotl/cli/__init__.py", line 403, in load_datasets
    train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset
    train_dataset, eval_dataset, prompters = load_prepare_datasets(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets
    dataset, prompters = load_tokenized_prepared_datasets(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 399, in load_tokenized_prepared_datasets
    dataset_wrapper, dataset_prompter = get_dataset_wrapper(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 677, in get_dataset_wrapper
    raise ValueError(
ValueError: unhandled prompt tokenization strategy: sharegpt

Interestingly, training command below runs fine without any errors. accelerate launch -m axolotl.cli.train configs/test.yaml

Steps to reproduce

I don't think the data is important, but I've attached example data below. (some of val.jsonl):

{"conversations": [{"from": "system", "value": "You are a helpful AI assistant."}, {"from": "user", "value": "Question: A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?\nA. Disclose the error to the patient and put it in the operative report\nB. Tell the attending that he cannot fail to disclose this mistake\nC. Report the physician to the ethics committee\nD. Refuse to dictate the operative report\n"}, {"from": "assistant", "value": "Answer: B. Tell the attending that he cannot fail to disclose this mistake"}]}

run python -m axolotl.cli.preprocess test.yaml --debug

Config yaml

base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

data_seed: 49
seed: 49

datasets:
  - path: /home/work/.test/data/pubmed/train.jsonl
    type: sharegpt
    conversation: llama3
    train_on_split: train

  - path: /home/work/.test/data/pubmed/val.jsonl
    type: sharegpt
    conversation: llama3
    train_on_split: validation
dataset_prepared_path: /home/work/.test/data/last_run_prepared
output_dir: ./out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

gradient_accumulation_steps: 8
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps:
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10.12

axolotl branch-commit

main/2147cf68 Llama3 dpo (#1610)

Acknowledgements

ryj0902 commented 4 months ago

Solved by adding chat_template: llama3 to the config file.

Previously, even if chat_template was not declared in config, the register_llama3_template () function was called through the else syntax.
However, after the code was modified, the variable was essential.

diff --git a/src/axolotl/cli/preprocess.py b/src/axolotl/cli/preprocess.py
index a95427d..e7b3596 100644
--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -39,21 +39,22 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
         return_remaining_strings=True
     )

-    if parsed_cfg.chat_template == "chatml" and parsed_cfg.default_system_message:
-        LOG.info(
-            f"ChatML set. Adding default system message: {parsed_cfg.default_system_message}"
-        )
-        register_chatml_template(parsed_cfg.default_system_message)
-    else:
-        register_chatml_template()
-
-    if parsed_cfg.chat_template == "llama3" and parsed_cfg.default_system_message:
-        LOG.info(
-            f"LLaMA-3 set. Adding default system message: {parsed_cfg.default_system_message}"
-        )
-        register_llama3_template(parsed_cfg.default_system_message)
-    else:
-        register_llama3_template()
+    if parsed_cfg.chat_template == "chatml":
+        if parsed_cfg.default_system_message:
+            LOG.info(
+                f"ChatML set. Adding default system message: {parsed_cfg.default_system_message}"
+            )
+            register_chatml_template(parsed_cfg.default_system_message)
+        else:
+            register_chatml_template()
+    elif parsed_cfg.chat_template == "llama3":
+        if parsed_cfg.default_system_message:
+            LOG.info(
+                f"LLaMA-3 set. Adding default system message: {parsed_cfg.default_system_message}"
+            )
+            register_llama3_template(parsed_cfg.default_system_message)
+        else:
+            register_llama3_template()
Hasan-Demez commented 1 day ago

mine looks like this

if parsed_cfg.chat_template == "chatml":
        if parsed_cfg.default_system_message:
            LOG.info(
                f"ChatML set. Adding default system message: {parsed_cfg.default_system_message}"
            )
            register_chatml_template(parsed_cfg.default_system_message)
        else:
            register_chatml_template()
    elif parsed_cfg.chat_template == "llama3":
        if parsed_cfg.default_system_message:
            LOG.info(
                f"LLaMA-3 set. Adding default system message: {parsed_cfg.default_system_message}"
            )
            register_llama3_template(parsed_cfg.default_system_message)
        else:
            register_llama3_template()

they are basically the same

Hasan-Demez commented 1 day ago

nvm i changed it to yours and it worked