Closed albertvillanova closed 2 years ago
@albertvillanova done: https://huggingface.co/datasets/bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020
I flattened the original data directory and encoded that information in the file names. (Standard parallel dataset with one sentence per line.)
This dataset is not loading:
from datasets import load_dataset
ds = load_dataset("bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))
gives:
FileNotFoundError: Couldn't find a dataset script at bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020/vietnamese_MT_EV_VLSP2020.py or any data file in the same directory. Couldn't find 'bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020' on the Hugging Face Hub either: FileNotFoundError: Unable to resolve any data file that matches ['**train*'] in dataset repository bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020 with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'zip']
Oh; sorry. Was I meant to write a loading script?
I have created another repo with a loading script specific for Language Modeling: https://huggingface.co/datasets/bigscience-catalogue-data/vinbigdata_mt_vlsp_2020_lm
Sample:
{'text': 'Ngân hàng HSBC sẽ sa thải 30.000 việc làm mặc dù lợi nhuận trước thuế tăng'}
DONE: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_vi_vinbigdata_mt_vlsp_2020
Sample:
{'text': 'Tôi phải đi ngủ.',
'meta': "{'file': 'MT-EV-VLSP2020/basic/data.vi'}"}