SpeechColab / GigaSpeech2

An evolving, large-scale and multi-domain ASR corpus for low-resource languages with automated crawling, transcription and refinement
Apache License 2.0
118 stars 6 forks source link

load_dataset bug #3

Closed ruby11dog closed 4 months ago

ruby11dog commented 5 months ago

when i run "dataset = load_dataset("speechcolab/gigaspeech2", split='data.th')" The program was interrupted by: Traceback (most recent call last): File "/root/miniforge3/envs/audio_process/lib/python3.8/site-packages/datasets/builder.py", line 1894, in _prepare_split_single writer.write_table(table) File "/root/miniforge3/envs/audio_process/lib/python3.8/site-packages/datasets/arrow_writer.py", line 570, in write_table pa_table = table_cast(pa_table, self._schema) File "/root/miniforge3/envs/audio_process/lib/python3.8/site-packages/datasets/table.py", line 2324, in table_cast return cast_table_to_schema(table, schema) File "/root/miniforge3/envs/audio_process/lib/python3.8/site-packages/datasets/table.py", line 2282, in cast_table_to_schema raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match") ValueError: Couldn't cast

yfyeung commented 4 months ago

Hi, @ruby11dog

The addition of the script gigaspeech2.py to Hugging Face is currently disabled. We do not support this functionality as our dataset was not uploaded through datasets.

To download the Thai subset, you can use the following commands:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/speechcolab/gigaspeech2
git lfs pull --include "data/th"