Open thepowerfuldeez opened 3 months ago
Could be related to #1624 ? Although I presume if this comes up with just the preprocess
command, it might be something else.
@venkatasg I don't think so. Training with other dataset formats works for me.
Can you try like this?
datasets:
- path: teknium/OpenHermes-2.5
conversation: chatml
type: sharegpt
I have the same issue during training, trying to finetune on OpenHermes-2.5. I've tried:
datasets:
- path: teknium/OpenHermes-2.5
conversation: chatml
type: sharegpt
[2024-05-23 14:57:44,691] [INFO] [axolotl.load_tokenizer:294] [PID:3848826] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
Tokenizing Prompts (num_proc=64): 0%| | 0/1001202 [00:22<?, ? examples/s]
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single
[rank0]: example = apply_function_on_filtered_inputs(example, i, offset=offset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/nlu/anaconda3/envs/axolotl/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/nlu/axolotl/src/axolotl/prompt_tokenizers.py", line 464, in tokenize_prompt
[rank0]: raise InvalidDataException(str(err)) from err
[rank0]: axolotl.prompt_tokenizers.InvalidDataException: 'conversations'
[rank0]: """
If you are training Llama3, Just add this in your config file:
datasets:
- path: teknium/OpenHermes-2.5
type: sharegpt
conversation: llama3
chat_template: llama3
I didn't tested chatml, but for me llama3 works fine.
yup, same for chatml:
chat_template: chatml
+1 not working even with the hot fix adding chat_template
It seems to be the same issue as this one: #1614
This project has become so ridiculously unstable, I think it's time to look for alternatives.
Please check that this issue hasn't been reported before.
Expected Behavior
dataset type sharegpt should work
Current behaviour
I try to use OpenHermes sharegpt dataset in the config:
File "/home/george/axolotl/src/axolotl/cli/init.py", line 403, in load_datasets train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset( ^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset train_dataset, eval_dataset, prompters = load_prepare_datasets( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets dataset, prompters = load_tokenized_prepared_datasets( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 399, in load_tokenized_prepared_datasets dataset_wrapper, dataset_prompter = get_dataset_wrapper( ^^^^^^^^^^^^^^^^^^^^ File "/home/george/axolotl/src/axolotl/utils/data/sft.py", line 677, in get_dataset_wrapper raise ValueError( ValueError: unhandled prompt tokenization strategy: sharegpt
accelerate launch -m axolotl.cli.preprocess CONFIG
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/4d6490b
Acknowledgements