jsonl broken, will only read as json

carelesswhisp commented 5 months ago

any time I try to use the JSONL I get this error 03:29:19-716909 INFO Loading JSONL datasets...
Traceback (most recent call last): File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py", line 407, in call_prediction output = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/route_utils.py", line 226, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1550, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1199, in call_function prediction = await utils.async_iteration(iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 519, in async_iteration return await iterator.anext() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 512, in anext return await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 851, in run result = context.run(func, args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async return next(iterator) ^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 649, in gen_wrapper yield from f(args, **kwargs) File "/media/cher/brains/text-generation-webui/extensions/Training_PRO_wip/script.py", line 466, in check_dataset loaded_JSONLdata = json.load(dataFile) ^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/init.py", line 293, in load return loads(fp.read(), ^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 4268)

it's loading all jsonl as json? so the next lines will always cause an error. this seems to be with every model. I've tried so far GPT2, mistral and lmsys_vicuna

carelesswhisp commented 5 months ago

something that can take normal jsonl like gpt would be great, where I can essentially transcribe a show and have the ai take on the personality of a character but have full context of an episode. such as

{"messages": [{"role": "user", "content": "text text text"}, {"role": "assistant", "content": "text text"}, {"role": "user", "content": "text text"},

Sohex commented 4 months ago

Jsonl 'works', but the extension needs it to be formatted incorrectly. Wrap the whole thing like an array (e.g.[]) and add commas at the end of all but the last line and it'll work.

To clarify, the correct format for jsonl looks like this:

{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
...

Whereas right now Training_PRO expects:

[
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
...]

carelesswhisp commented 4 months ago

ah gotcha. though i do notice it will now give the error " raise TemplateError(message) jinja2.exceptions.TemplateError: Conversation roles must alternate user/assistant/user/assistant/..."

meaning it can't take like a script and format it, can we just modify this template?

Edit: just to explain this isn't an issue with the training pro (I think) this has to do with the embedded template in the tokenizer. so what ever model you're using will determine the format. vicuna - v1.1 actually seems to work out of the box.

FartyPants / Training_PRO

jsonl broken, will only read as json #17