SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for VLSP2020 MT #629

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: vlsp2020_mt_envi/vlsp2020_mt_envi.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?vlsp2020_mt_envi

Dataset vlsp2020_mt_envi
Description Parallel and monolingual data for training machine translation systems translating English texts into Vietnamese, with a focus on news domain. The data was crawled from high-quality bilingual or multilingual websites of news and one-speaker educational talks on various topics, mostly technology, entertainment, and design (hereby referred to as TED-like talks). The dataset also includes noisy movie subtitles from the OpenSubtitle dataset.
Subsets -
Languages vie
Tasks Machine Translation
License Unknown (unknown)
Homepage https://github.com/thanhleha-kit/EnViCorpora
HF URL -
Paper URL -
patrickamadeus commented 3 months ago

self-assign