dataset type sharegpt no longer works

thepowerfuldeez commented 3 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

dataset type sharegpt should work

Current behaviour

I try to use OpenHermes sharegpt dataset in the config:

datasets:
  - path: teknium/OpenHermes-2.5
    type: sharegpt```

 but get the error

File "/home/george/axolotl/src/axolotl/cli/init.py", line 403, in load_datasets train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset( ^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset train_dataset, eval_dataset, prompters = load_prepare_datasets( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets dataset, prompters = load_tokenized_prepared_datasets( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 399, in load_tokenized_prepared_datasets dataset_wrapper, dataset_prompter = get_dataset_wrapper( ^^^^^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 677, in get_dataset_wrapper raise ValueError( ValueError: unhandled prompt tokenization strategy: sharegpt


I rolled back to the commit with hash 7018576 and it works there.

### Steps to reproduce

Change config to contain sharegpt dataset type:

```chat_template: chatml
datasets:
  - path: teknium/OpenHermes-2.5
    type: sharegpt```

run preprocess:

accelerate launch -m axolotl.cli.preprocess CONFIG


### Config yaml

```yaml
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

chat_template: chatml
datasets:
  - path: teknium/OpenHermes-2.5
    type: sharegpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.03
output_dir: ./checkpoints/qlora-llama3-openhermes-loraplus

adapter: qlora
lora_model_dir:

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

loraplus_lr_ratio: 2e3 # loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4.
loraplus_lr_embedding: 1e-6 #  loraplus learning rate for lora embedding layers. Default value is 1e-6.
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: axolotl
wandb_entity: thepowerfuldeez
wandb_watch:
wandb_name: llama3_qlora_openhermes_loraplus
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0001

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/4d6490b

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

venkatasg commented 3 months ago

Could be related to #1624 ? Although I presume if this comes up with just the preprocess command, it might be something else.

thepowerfuldeez commented 3 months ago

@venkatasg I don't think so. Training with other dataset formats works for me.

CyberNativeAI commented 3 months ago

Can you try like this?

datasets:
  - path: teknium/OpenHermes-2.5
    conversation: chatml
    type: sharegpt

timpal0l commented 3 months ago

I have the same issue during training, trying to finetune on OpenHermes-2.5. I've tried:

datasets:
  - path: teknium/OpenHermes-2.5
    conversation: chatml
    type: sharegpt

[2024-05-23 14:57:44,691] [INFO] [axolotl.load_tokenizer:294] [PID:3848826] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
Tokenizing Prompts (num_proc=64):   0%|                                                                                                                                                                                                                                                    | 0/1001202 [00:22<?, ? examples/s]

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]:     result = (True, func(*args, **kwds))
[rank0]:                     ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]:     for i, result in enumerate(func(**kwargs)):
[rank0]:   File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single
[rank0]:     example = apply_function_on_filtered_inputs(example, i, offset=offset)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
[rank0]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/nlu/axolotl/src/axolotl/prompt_tokenizers.py", line 464, in tokenize_prompt
[rank0]:     raise InvalidDataException(str(err)) from err
[rank0]: axolotl.prompt_tokenizers.InvalidDataException: 'conversations'
[rank0]: """

JordiBayarri commented 3 months ago

If you are training Llama3, Just add this in your config file:

datasets:
  - path: teknium/OpenHermes-2.5
    type: sharegpt
    conversation: llama3 

chat_template: llama3

I didn't tested chatml, but for me llama3 works fine.

CyberNativeAI commented 3 months ago

yup, same for chatml: chat_template: chatml

wayne-wang-1119 commented 3 months ago

+1 not working even with the hot fix adding chat_template

ryj0902 commented 3 months ago

It seems to be the same issue as this one: #1614

qeternity commented 1 month ago

This project has become so ridiculously unstable, I think it's time to look for alternatives.

axolotl-ai-cloud / axolotl