TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

bo-jpg commented 2 months ago

Thank you for your contribution. I encountered the following error when training with toy data:

I read online that the following reasons may be the cause:

The maximum length of the tokenizer is not set;
There are blank lines in the jsonl file;
The higher version transformer library is incompatible;
There are Nan values in the data. However, I tried the solutions corresponding to the above 4 reasons, and this error is still reported. I want to know why. Thank you very much!

Muennighoff commented 2 months ago

i just checked and the command under GRIT here https://github.com/ContextualAI/gritlm?tab=readme-ov-file#run works fine for me

bo-jpg commented 2 months ago

i just checked and the command under GRIT here https://github.com/ContextualAI/gritlm?tab=readme-ov-file#run works fine for me

Thanks for the quick response!

Here is my config： --per_device_train_batch_size 2 \ --gradient_accumulation_steps 1 \ --per_device_generative_bs 1 \

I printed out my toy data before entering the tokenizer:

[default6]:['He He Me It I You You You You You', 'Me I He You Me He It Me It She'] [default6]:['Me He She He She It He She She Me', 'It He It She I I It He You She', 'Me You It Me Me She You I It He', 'It She She He Me It I You It You'] [default6]:['大人你大人大人大人他享受大人享受你', None] [default7]:['Me He Me She I She You I It She', 'She It She She Me Me Me Me She Me'] [default7]:['You He I I She He I I He It', 'Me He It Me It He He She I You', 'I She He You He It You She It He', 'He It Me You He She I It Me He'] [default7]:['我我是享受是他你我他你', None]

We can see that there is an extra None in the batch of generative data, which should be the cause of the error. Why is this? Is it related to the following warning？

[default4]:/home/code/.python_libs/conda_env/myenv/lib/python3.9/site-packages/accelerate/accelerator.py:447: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead: [default4]:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)

Muennighoff commented 2 months ago

It seems like you're using your own custom data? Maybe you have None's in your data

bo-jpg commented 2 months ago

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)

I checked my toy data and it does not contain None value. In addition, I used the toy data you provided and it also reported this error, and the batch contained None:

[default1]:['What is the difference between a raspberry pi and an esp32? What is better suited for interfacing with a SD card? The Raspberry Pi is a single-board computer that runs a full-fledged operating system, while the ESP32 is a microcontroller that is typically used for IoT applications. The Raspberry Pi is better suited for interfacing with an SD card as it has a full-fledged operating system and a large number of libraries available for interfacing with various peripherals, including SD cards. The ESP32, on the other hand, has limited memory and processing power, and may require more effort to interface with an SD card.', None]

bo-jpg commented 2 months ago

@Muennighoff If I set --mode embedding, the training is OK. But if I set --mode unified, the generative data batch contains None and an error is reported. I want to know why there are some extra None values in the generative data batch?

bo-jpg commented 2 months ago

Hi, I roughly looked at the code in gritlm/gritlm/training /data.py, and I think the None in the generative batch data is caused by lines 90, 131, 140, and 141. Looking forward to your reply!

ContextualAI / gritlm

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #48