Open TheGermanEngie opened 1 year ago
It's complaining about a missing comma in the json file you are loading. Where did you get this file from? Have you tried downloading it again?
It's from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. I just cut the file 52k .json to match the size of the default alpaca lit-llama's set for, it's less than a megabyte smaller. The last ] is at line 137003.
Here's the last object:
`{ "instruction": "Analyze the given legal document and explain the key points.", "input": "The following is an excerpt from a contract between two parties, labeled \"Company A\" and \"Company B\": \n\n\"Company A agrees to provide reasonable assistance to Company B in ensuring the accuracy of the financial statements it provides. This includes allowing Company A reasonable access to personnel and other documents which may be necessary for Company B\u2019s review. Company B agrees to maintain the document provided by Company A in confidence, and will not disclose the information to any third parties without Company A\u2019s explicit permission.\"", "output": "The contract between Company A and Company B outlines an understanding between the two parties, whereby Company A agrees to provide assistance in ensuring the accuracy of financial statements provided by Company B. This assistance may include providing access to personnel and important documents, as deemed necessary by Company B. In return, Company B agrees to keep the information disclosed by Company A confidential and not share it with any third party without obtaining permission from Company A. This contract aims to promote accuracy and transparency while preserving confidentiality of sensitive information." }
]`
Sorry for spam... it's not formatting correctly in the github comments.
I changed OSes to POP! OS and the issue still comes up. Perhaps it's a deeper issue.
Traceback (most recent call last): File "/home/mookie/AI/lit-llama/finetune/lora.py", line 218, in <module> CLI(main) File "/home/mookie/miniconda3/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI return _run_component(component, cfg_init) File "/home/mookie/miniconda3/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component return component(**cfg) File "/home/mookie/AI/lit-llama/finetune/lora.py", line 62, in main train_data, val_data = load_datasets(data_dir=data_dir) File "/home/mookie/AI/lit-llama/finetune/lora.py", line 206, in load_datasets train_data = torch.load(os.path.join(data_dir, "train.pt")) File "/home/mookie/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 791, in load with _open_file_like(f, 'rb') as opened_file: File "/home/mookie/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 271, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/mookie/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 252, in __init__ super().__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'data/alpaca/train.pt'
I found there's a super slight { indenting change between the custom set and the default set. Also, didn't know you needed to add arguments and full paths for lora or llama optimizer, I just read straight "type python lora.py" from howto/ . I'm 90% confident those were the problems.
I wanted to try and replace the default alpaca dataset to fine-tune on with Microsoft Research's GPT-4 variant. I cut down the GPT-4 set to match the file size accordingly, then ran into this JSONDecodeError at the very end of the dataset. Kept popping up because I replaced it wrong due to lack of knowledge, but even if I replace it with a part of the original dataset it fails.
It's somewhat similar to #239 and #283.
JSONDecodeError: Expecting ',' delimiter: line 137001 column 1 (char 22579669)
All I did for halfgpt4.py was replace the https:// and file name with a clone of my repo.
Sorry this format is really funky. Hope you can read it ok.