SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
68 stars 57 forks source link

Create dataset loader for PhoMT #115

Closed SamuelCahyawijaya closed 7 months ago

SamuelCahyawijaya commented 12 months ago

Dataloader name: phomt/phomt.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?phomt

Dataset phomt
Description We present a high-quality and large-scale Vietnamese-English parallel dataset, named PhoMT, that consists of 3.02M sentence pairs. Here, from PhoMT, we also prepare 38K sentence pairs with manually qualitative inspection, that are used for validation and test. We believe that our dataset construction process will help develop more efficient data creation strategies for other low-resource languages
Subsets -
Languages vie, eng
Tasks Machine Translation
License Unknown (unknown)
Homepage https://github.com/VinAIResearch/PhoMT
HF URL -
Paper URL https://aclanthology.org/2021.emnlp-main.369.pdf
yana-xuyan commented 11 months ago

self-assign

github-actions[bot] commented 11 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

yana-xuyan commented 10 months ago

Hi, just FYI, I'm still requesting the authors for the dataset. Please kindly share the dataset with me if you have the data. Many thanks!

sabilmakbar commented 10 months ago

I don't have it either atm. Let's ask it on discord channel and see if anyone has it. Thanks for the update, @yana-xuyan!

sabilmakbar commented 10 months ago

Hi @yana-xuyan, as of now, have you gotten the dataset files yet? If you don't, someone else in Discord Channel had the data, and prob we can ask him to share it (ideally the whole data, but if that's not possible, we can ask for a part of it). Although it's unfavorable because the original dataset author still needs to consent to such usages, that's the best we can do now.

yana-xuyan commented 10 months ago

Hi, thank you for asking! I haven’t got the dataset yet. Could you please help ask him for the data? Sadly, I don’t have the access to the Discord Channel.

获取 Outlook for iOShttps://aka.ms/o0ukef


发件人: Salsabil Maulana Akbar @.> 发送时间: Friday, January 5, 2024 3:24:23 AM 收件人: SEACrowd/seacrowd-datahub @.> 抄送: Yan XU @.>; Mention @.> 主题: Re: [SEACrowd/seacrowd-datahub] Create dataset loader for PhoMT (Issue #115)

Hi @yana-xuyanhttps://github.com/yana-xuyan, as of now, have you gotten the dataset files yet? If you don't, someone else in Discord Channel had the data, and prob we can ask him to share it (ideally the whole data, but if that's not possible, we can ask for a part of it). Although it's unfavorable because the original dataset author still needs to consent to such usages, that's the best we can do now.

― Reply to this email directly, view it on GitHubhttps://github.com/SEACrowd/seacrowd-datahub/issues/115#issuecomment-1877641454, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJGALO6QWREEPGGZQUDRWULYM36WPAVCNFSM6AAAAAA7VROSHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGY2DCNBVGQ. You are receiving this because you were mentioned.Message ID: @.***>

holylovenia commented 10 months ago

@yana-xuyan I've sent the dataset link in an email to you. Please let me know if you need anything else.

yana-xuyan commented 10 months ago

Hi, Holy! Yes, the email is well-received :)

holylovenia commented 8 months ago

Hi @yana-xuyan, may I know if there's any further issue regarding this dataloader?

yana-xuyan commented 8 months ago

Hi Holy, I was occupied by my work recently. Will finish this dataloader by the end of this week :)