When training with Llama-2 as backbone, it fails during the preprocess of dataset:
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.31s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.22s/it]
Map: 0%| | 0/2628260 [00:02<?, ? examples/s]
Traceback (most recent call last):
File "xxx/OpenP5/command/../src/train.py", line 276, in
main()
File "xxx/OpenP5/command/../src/train.py", line 170, in main
TrainSet = train_data['train'].shuffle().map(process_func, batched=True)
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 591, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs)
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs)
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3089, in map
for rank, done, content in Dataset._map_single(dataset_kwargs):
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3485, in _map_single
writer.write_batch(batch)
File "xxx/lib/python3.9/site-packages/datasets/arrow_writer.py", line 559, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
File "pyarrow/table.pxi", line 3986, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 3266, in pyarrow.lib.Table.validate
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 5 named input_ids expected length 1000 but got length 1024
I've try to set "cutoff=1000" but it still went wrong. It seems that the tokenizer lacks padding process, could you provide with some example for applying LLaMA-2 as backbone? Thank you.
When training with Llama-2 as backbone, it fails during the preprocess of dataset: Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.31s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.22s/it] Map: 0%| | 0/2628260 [00:02<?, ? examples/s] Traceback (most recent call last): File "xxx/OpenP5/command/../src/train.py", line 276, in
main()
File "xxx/OpenP5/command/../src/train.py", line 170, in main
TrainSet = train_data['train'].shuffle().map(process_func, batched=True)
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 591, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs)
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs)
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3089, in map
for rank, done, content in Dataset._map_single(dataset_kwargs):
File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3485, in _map_single
writer.write_batch(batch)
File "xxx/lib/python3.9/site-packages/datasets/arrow_writer.py", line 559, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
File "pyarrow/table.pxi", line 3986, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 3266, in pyarrow.lib.Table.validate
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 5 named input_ids expected length 1000 but got length 1024
I've try to set "cutoff=1000" but it still went wrong. It seems that the tokenizer lacks padding process, could you provide with some example for applying LLaMA-2 as backbone? Thank you.