agiresearch / OpenP5

OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems
Apache License 2.0
254 stars 20 forks source link

How to apply LLaMA-2 as backbone #14

Closed Tingji2419 closed 11 months ago

Tingji2419 commented 12 months ago

When training with Llama-2 as backbone, it fails during the preprocess of dataset: Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.31s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.22s/it] Map: 0%| | 0/2628260 [00:02<?, ? examples/s] Traceback (most recent call last): File "xxx/OpenP5/command/../src/train.py", line 276, in main() File "xxx/OpenP5/command/../src/train.py", line 170, in main TrainSet = train_data['train'].shuffle().map(process_func, batched=True) File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 591, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 556, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3089, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "xxx/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3485, in _map_single writer.write_batch(batch) File "xxx/lib/python3.9/site-packages/datasets/arrow_writer.py", line 559, in write_batch pa_table = pa.Table.from_arrays(arrays, schema=schema) File "pyarrow/table.pxi", line 3986, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 3266, in pyarrow.lib.Table.validate File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column 5 named input_ids expected length 1000 but got length 1024

I've try to set "cutoff=1000" but it still went wrong. It seems that the tokenizer lacks padding process, could you provide with some example for applying LLaMA-2 as backbone? Thank you.

shuyuan-x commented 11 months ago

Thanks for your interest! We have updated the code to fix this issue.