Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.74k stars 1.07k forks source link

TypeError: TextInputSequence must be str #1759

Open hemanth opened 1 month ago

hemanth commented 1 month ago

Bug description

⚡ ~ litgpt finetune_lora meta-llama/Llama-3.2-1B   --data JSON   --data.json_path sanksrit-dataset.json   --data.val_split_fraction 0.1   --train.epochs 1   --out_dir out/llama-3.2-finetuned   --precision bf16-true > res
Seed set to 1337
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 843, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 929, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 934, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 218, in main
    fit(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 270, in fit
    longest_seq_length, longest_seq_ix = get_longest_seq_length(ConcatDataset([train_dataloader.dataset, val_dataloader.dataset]))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 438, in get_longest_seq_length
    lengths = [len(d["input_ids"]) for d in data]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 438, in <listcomp>
    lengths = [len(d["input_ids"]) for d in data]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 335, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/data/base.py", line 83, in __getitem__
    encoded_response = self.tokenizer.encode(example["output"], bos=False, eos=True, max_length=self.max_seq_length)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/tokenizer.py", line 114, in encode
    tokens = self.processor.encode(string).ids
TypeError: TextInputSequence must be str

Dataset looks like:

[
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  ......
]

What operating system are you using?

Linux

LitGPT Version

0.4.13
rasbt commented 1 month ago

Hi there, could you try this with a very small text example that only consists of a few entries, e.g., repeated versions of the entry you showed:

[
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
]

This is just to further find out if the issue is because of non-Latin characters in the input field or maybe because some of the fields potentially have other formatting issues.