SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for scb-mt-en-th-2020 #219

Closed SamuelCahyawijaya closed 8 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: scb-mt-en-th/scb-mt-en-th.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?scb-mt-en-th

Dataset scb-mt-en-th
Description A Large English-Thai Parallel Corpus The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.
Subsets -
Languages tha, eng
Tasks Machine Translation
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/vistec-AI/thai2nmt
HF URL https://huggingface.co/datasets/scb_mt_enth_2020
Paper URL https://link.springer.com/article/10.1007/s10579-021-09536-6
jensan-1 commented 9 months ago

self-assign

jensan-1 commented 9 months ago

Hello @SamuelCahyawijaya,

I want to report that the DataCatalogue link above does not work. Instead, I found this link works for this dataset: https://seacrowd.github.io/seacrowd-catalogue/card.html?scb-mt-en-th-2020.

Therefore, one clarification: Should the dataset name be scb-mt-en-th or scb-mt-en-th-2020? I think the dataset name reported in the title, card, and dataloader name should be unified.

Thanks for taking a look at this problem!

sabilmakbar commented 8 months ago

Thanks for reporting this, @jen-santoso! We'll check into it. Also, I'm thinking that the dataset name on dataloaders should be implemented in snakecase naming convention, hence scb_mt_en_th_2020 or scb_mt_en_th.

jensan-1 commented 8 months ago

Thanks for the snakecase catch @sabilmakbar ! I'll rename the files as soon as the dataloader name is decided (with or without 2020 at the end).

UPDATE: pushed the snakecase dataloader name fix

holylovenia commented 8 months ago

Hello @SamuelCahyawijaya,

I want to report that the DataCatalogue link above does not work. Instead, I found this link works for this dataset: https://seacrowd.github.io/seacrowd-catalogue/card.html?scb-mt-en-th-2020.

Therefore, one clarification: Should the dataset name be scb-mt-en-th or scb-mt-en-th-2020? I think the dataset name reported in the title, card, and dataloader name should be unified.

Thanks for taking a look at this problem!

Hi @jen-santoso, thanks for notifying us regarding the error. Could you please look up the info using this monitor sheet for the time being?