bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset vietnamese_MT_EV_VLSP2020 #254

Closed albertvillanova closed 2 years ago

albertvillanova commented 2 years ago
cakiki commented 2 years ago

self-assign

cakiki commented 2 years ago

@albertvillanova done: https://huggingface.co/datasets/bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020

I flattened the original data directory and encoded that information in the file names. (Standard parallel dataset with one sentence per line.)

albertvillanova commented 2 years ago

This dataset is not loading:

from datasets import load_dataset
ds = load_dataset("bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))

gives:


FileNotFoundError: Couldn't find a dataset script at bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020/vietnamese_MT_EV_VLSP2020.py or any data file in the same directory. Couldn't find 'bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020' on the Hugging Face Hub either: FileNotFoundError: Unable to resolve any data file that matches ['**train*'] in dataset repository bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020 with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'zip']
cakiki commented 2 years ago

Oh; sorry. Was I meant to write a loading script?

albertvillanova commented 2 years ago

I have created another repo with a loading script specific for Language Modeling: https://huggingface.co/datasets/bigscience-catalogue-data/vinbigdata_mt_vlsp_2020_lm

Sample:

{'text': 'Ngân hàng HSBC sẽ sa thải 30.000 việc làm mặc dù lợi nhuận trước thuế tăng'}
albertvillanova commented 2 years ago

DONE: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_vi_vinbigdata_mt_vlsp_2020

Sample:


{'text': 'Tôi phải đi ngủ.',
 'meta': "{'file': 'MT-EV-VLSP2020/basic/data.vi'}"}