Lightning-AI / lit-llama

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
Apache License 2.0
5.99k stars 520 forks source link

Error while running modified prepare_alpaca.py on Linux Mint 21.1 #401

Open TheGermanEngie opened 1 year ago

TheGermanEngie commented 1 year ago

I wanted to try and replace the default alpaca dataset to fine-tune on with Microsoft Research's GPT-4 variant. I cut down the GPT-4 set to match the file size accordingly, then ran into this JSONDecodeError at the very end of the dataset. Kept popping up because I replaced it wrong due to lack of knowledge, but even if I replace it with a part of the original dataset it fails.

It's somewhat similar to #239 and #283.

Traceback (most recent call last) 
│ /home/mookie/AI/lit-llama/scripts/prepare_halfgpt4.py:131 in <module>                            
│                                                                                                  
│   128 if __name__ == "__main__":                                                                 
│   129 │   from jsonargparse import CLI                                                           
│   130 │                                                                                          
│ ❱ 131 │   CLI(prepare)                                                                           
│   132                                                                                           
│                                                                                                  
│ /home/mookie/miniconda3/lib/python3.10/site-packages/jsonargparse/cli.py:85 in CLI               
│                                                                                                  
│    82 │   │   │   return parser                                                                  
│    83 │   │   cfg = parser.parse_args(args)                                                      │
│    84 │   │   cfg_init = parser.instantiate_classes(cfg)                                         │
│ ❱  85 │   │   return _run_component(component, cfg_init)                                         │
│    86 │                                                                                          │
│    87 │   subcommands = parser.add_subcommands(required=True)                                    │
│    88 │   comp_dict = {c.__name__: c for c in components}                                        │
│                                                                                                  │
│ /home/mookie/miniconda3/lib/python3.10/site-packages/jsonargparse/cli.py:147 in _run_component   │
│                                                                                                  │
│   144 def _run_component(component, cfg):                                                        │
│   145 │   cfg.pop("config", None)                                                                │
│   146 │   if not inspect.isclass(component):                                                     │
│ ❱ 147 │   │   return component(**cfg)                                                            │
│   148 │   subcommand = cfg.pop("subcommand")                                                     │
│   149 │   if not subcommand:                                                                     │
│   150 │   │   return component(**cfg)                                                            │
│                                                                                                  │
│ /home/mookie/AI/lit-llama/scripts/prepare_halfgpt4.py:45 in prepare                              │
│                                                                                                  │
│    42 │   tokenizer = Tokenizer(tokenizer_path)                                                  │
│    43 │                                                                                          │
│    44 │   with open(file_path, "r") as file:                                                     │
│ ❱  45 │   │   data = json.load(file)                                                             │
│    46 │                                                                                          │
│    47 │   # Partition the dataset into train and test                                            │
│    48 │   train_split_size = len(data) - test_split_size                                         │
│                                                                                                  │
│ /home/mookie/miniconda3/lib/python3.10/json/__init__.py:293 in load                              │
│                                                                                                  │
│   290 │   To use a custom ``JSONDecoder`` subclass, specify it with the ``cls``                  │
│   291 │   kwarg; otherwise ``JSONDecoder`` is used.                                              │
│   292 │   """                                                                                    │
│ ❱ 293 │   return loads(fp.read(),                                                                │
│   294 │   │   cls=cls, object_hook=object_hook,                                                  │
│   295 │   │   parse_float=parse_float, parse_int=parse_int,                                      │
│   296 │   │   parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)          │
│                                                                                                  │
│ /home/mookie/miniconda3/lib/python3.10/json/__init__.py:346 in loads                             │
│                                                                                                  │
│   343 │   if (cls is None and object_hook is None and                                            │
│   344 │   │   │   parse_int is None and parse_float is None and                                  │
│   345 │   │   │   parse_constant is None and object_pairs_hook is None and not kw):              │
│ ❱ 346 │   │   return _default_decoder.decode(s)                                                  │
│   347 │   if cls is None:                                                                        │
│   348 │   │   cls = JSONDecoder                                                                  │
│   349 │   if object_hook is not None:                                                            │
│                                                                                                  │
│ /home/mookie/miniconda3/lib/python3.10/json/decoder.py:337 in decode                             │
│                                                                                                  │
│   334 │   │   containing a JSON document).                                                       │
│   335 │   │                                                                                      │
│   336 │   │   """                                                                                │
│ ❱ 337 │   │   obj, end = self.raw_decode(s, idx=_w(s, 0).end())                                  │
│   338 │   │   end = _w(s, end).end()                                                             │
│   339 │   │   if end != len(s):                                                                  │
│   340 │   │   │   raise JSONDecodeError("Extra data", s, end)                                    │
│                                                                                                  │
│ /home/mookie/miniconda3/lib/python3.10/json/decoder.py:353 in raw_decode                         │
│                                                                                                  │
│   350 │   │                                                                                      │
│   351 │   │   """                                                                                │
│   352 │   │   try:                                                                               │
│ ❱ 353 │   │   │   obj, end = self.scan_once(s, idx)                                              │
│   354 │   │   except StopIteration as err:                                                       │
│   355 │   │   │   raise JSONDecodeError("Expecting value", s, err.value) from None               │
│   356 │   │   return obj, end                                                                    │

JSONDecodeError: Expecting ',' delimiter: line 137001 column 1 (char 22579669)

All I did for halfgpt4.py was replace the https:// and file name with a clone of my repo.

Sorry this format is really funky. Hope you can read it ok.

carmocca commented 1 year ago

It's complaining about a missing comma in the json file you are loading. Where did you get this file from? Have you tried downloading it again?

TheGermanEngie commented 1 year ago

It's from https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. I just cut the file 52k .json to match the size of the default alpaca lit-llama's set for, it's less than a megabyte smaller. The last ] is at line 137003.

Here's the last object:

TheGermanEngie commented 1 year ago

`{ "instruction": "Analyze the given legal document and explain the key points.", "input": "The following is an excerpt from a contract between two parties, labeled \"Company A\" and \"Company B\": \n\n\"Company A agrees to provide reasonable assistance to Company B in ensuring the accuracy of the financial statements it provides. This includes allowing Company A reasonable access to personnel and other documents which may be necessary for Company B\u2019s review. Company B agrees to maintain the document provided by Company A in confidence, and will not disclose the information to any third parties without Company A\u2019s explicit permission.\"", "output": "The contract between Company A and Company B outlines an understanding between the two parties, whereby Company A agrees to provide assistance in ensuring the accuracy of financial statements provided by Company B. This assistance may include providing access to personnel and important documents, as deemed necessary by Company B. In return, Company B agrees to keep the information disclosed by Company A confidential and not share it with any third party without obtaining permission from Company A. This contract aims to promote accuracy and transparency while preserving confidentiality of sensitive information." }

]`

TheGermanEngie commented 1 year ago

Sorry for spam... it's not formatting correctly in the github comments.

TheGermanEngie commented 1 year ago

I changed OSes to POP! OS and the issue still comes up. Perhaps it's a deeper issue.

Traceback (most recent call last): File "/home/mookie/AI/lit-llama/finetune/lora.py", line 218, in <module> CLI(main) File "/home/mookie/miniconda3/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI return _run_component(component, cfg_init) File "/home/mookie/miniconda3/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component return component(**cfg) File "/home/mookie/AI/lit-llama/finetune/lora.py", line 62, in main train_data, val_data = load_datasets(data_dir=data_dir) File "/home/mookie/AI/lit-llama/finetune/lora.py", line 206, in load_datasets train_data = torch.load(os.path.join(data_dir, "train.pt")) File "/home/mookie/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 791, in load with _open_file_like(f, 'rb') as opened_file: File "/home/mookie/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 271, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/mookie/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 252, in __init__ super().__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'data/alpaca/train.pt'

TheGermanEngie commented 1 year ago

I found there's a super slight { indenting change between the custom set and the default set. Also, didn't know you needed to add arguments and full paths for lora or llama optimizer, I just read straight "type python lora.py" from howto/ . I'm 90% confident those were the problems.