albertvillanova commented 2 years ago

uid: vietnamese_MT_EV_VLSP2020
type: processed
description:
- name: Vietnamese EN-VI Machine Translation VLSP 2020
- description: Bilingual Dataset EN-VI. Consisting of 20k samples. Domains including: News, openSub (3.5m), TED-like, EVBCorpus (45k), Wiki-alt (20k) và tập dữ liệu cơ bản (8.8k))
- homepage: https://vinbigdata.org/events/vinbigdata-chia-se-100-gio-du-lieu-tieng-noi-cho-cong-dong/
- validated: True
languages:
- language_names:
- Vietnamese
- language_comments: General, Formal, Semi-formal
- language_locations:
- South-eastern Asia
- Vietnam
- validated: False
custodian:
- name: VinBigData
- in_catalogue:
- type: A university or research institution
- location: Vietnam
- contact_name: VinBigData
- contact_email: info@vinbigdata.org
- contact_submitter: False
- additional: https://product.vinbigdata.org/contact/
- validated: False
availability:
- procurement:
- for_download: Yes - it has a direct download link or links
- download_url: https://drive.google.com/file/d/1Gkii6E2Xqcd6AiMvW1tB82TVoXkbBWTP/view?usp=sharing
- download_email:
- licensing:
- has_licenses: Unclear
- license_text: It is open-source
- license_properties:
- license_list:
- pii:
- has_pii: Unclear
- generic_pii_likely:
- generic_pii_list:
- numeric_pii_likely:
- numeric_pii_list:
- sensitive_pii_likely:
- sensitive_pii_list:
- no_pii_justification_class: general knowledge not written by or referring to private persons
- no_pii_justification_text:
- validated: False
processed_from_primary:
- from_primary: Original data
- primary_availability:
- primary_license:
- primary_types:
- validated: False
media:
- category:
- text
- text_format:
- .TXT
- audiovisual_format:
- image_format:
- database_format:
- .RAR
- text_is_transcribed: No
- instance_type: article
- instance_count: 100K<n<1M
- instance_size: 10<n<100
- validated: False
fname: vietnamese_MT_EV_VLSP2020.json

cakiki commented 2 years ago

self-assign

cakiki commented 2 years ago

@albertvillanova done: https://huggingface.co/datasets/bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020

I flattened the original data directory and encoded that information in the file names. (Standard parallel dataset with one sentence per line.)

albertvillanova commented 2 years ago

This dataset is not loading:

from datasets import load_dataset
ds = load_dataset("bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))

gives:


FileNotFoundError: Couldn't find a dataset script at bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020/vietnamese_MT_EV_VLSP2020.py or any data file in the same directory. Couldn't find 'bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020' on the Hugging Face Hub either: FileNotFoundError: Unable to resolve any data file that matches ['**train*'] in dataset repository bigscience-catalogue-data/vietnamese_MT_EV_VLSP2020 with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'zip']

cakiki commented 2 years ago

Oh; sorry. Was I meant to write a loading script?

albertvillanova commented 2 years ago

I have created another repo with a loading script specific for Language Modeling: https://huggingface.co/datasets/bigscience-catalogue-data/vinbigdata_mt_vlsp_2020_lm

Sample:

{'text': 'Ngân hàng HSBC sẽ sa thải 30.000 việc làm mặc dù lợi nhuận trước thuế tăng'}

albertvillanova commented 2 years ago

DONE: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_vi_vinbigdata_mt_vlsp_2020

Sample:


{'text': 'Tôi phải đi ngủ.',
 'meta': "{'file': 'MT-EV-VLSP2020/basic/data.vi'}"}

bigscience-workshop / data_tooling

Create dataset vietnamese_MT_EV_VLSP2020 #254

self-assign