abhinand5 / tamil-llama

A New Tamil Large Language Model (LLM) Based on Llama 2
GNU General Public License v3.0
236 stars 32 forks source link

Unable to load a dataset #2

Closed winstondcosta closed 7 months ago

winstondcosta commented 7 months ago

Unable to load a dataset from Huggingface

Steps to reproduce the bug

dataset_name = "abhinand/tamil-alpaca-orca" dataset = load_dataset(dataset_name, split="train")

Expected results

Loading the dataset

Actual results

Failed to read file '/home/cirrusrays/.cache/huggingface/datasets/downloads/3405a3ad3f4baf48e5db5d2e3d7e305cdc7aaed173814c1e4880bc2028ccbf46' with error <class 'ValueError'>: Couldn't cast instruction: string input: string output: string text: string system_prompt: string type: string -- schema metadata -- huggingface: '{"info": {"features": {"instruction": {"dtype": "string", "' + 266 to {'instruction': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'system_prompt': Value(dtype='string', id=None)} because column names don't match Generating train split: 0%| | 0/51876 [00:00<?, ? examples/s] Traceback (most recent call last): File "/home/cirrusrays/anaconda3/envs/envtamillama/lib/python3.10/site-packages/datasets/builder.py", line 1925, in _prepare_splitsingle for , table in generator: File "/home/cirrusrays/anaconda3/envs/envtamillama/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 86, in _generate_tables yield f"{fileidx}{batch_idx}", self._cast_table(pa_table) File "/home/cirrusrays/anaconda3/envs/envtamillama/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 66, in _cast_table pa_table = table_cast(pa_table, self.info.features.arrow_schema) File "/home/cirrusrays/anaconda3/envs/envtamillama/lib/python3.10/site-packages/datasets/table.py", line 2328, in table_cast return cast_table_to_schema(table, schema) File "/home/cirrusrays/anaconda3/envs/envtamillama/lib/python3.10/site-packages/datasets/table.py", line 2286, in cast_table_to_schema raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match") ValueError: Couldn't cast instruction: string input: string output: string text: string system_prompt: string type: string -- schema metadata -- huggingface: '{"info": {"features": {"instruction": {"dtype": "string", "' + 266 to {'instruction': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'system_prompt': Value(dtype='string', id=None)} because column names don't match

abhinand5 commented 7 months ago

Hi @winstondcosta the issue was with HF dataset's metadata. I've fixed it and now it should work for you.

image

Please delete the existing cache before retrying.

$ pip install huggingface_hub[cli]
$ huggingface-cli delete-cache