IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 61 forks source link

Create dataset loader for NusaTranslation MT #356

Closed SamuelCahyawijaya closed 11 months ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?nusa_translation_mt

Dataset nusa_translation_mt
Description NusaTranslation is a sentence-level datasets which covers 11 local languages in Indonesia. The dataset is human-translated from a part of IndoLEM Sentiment and EmoT dataseets where a native-speaker annotator are requested to translate to the target language given an Indonesian sentence. The data cover ~72k sentence pairs of translation data.
License CC-BY-NC-SA 4.0
catlaughing commented 1 year ago

self-assign

SamuelCahyawijaya commented 11 months ago

Closed in https://github.com/IndoNLP/nusa-crowd/pull/364

fhudi commented 10 months ago

@SamuelCahyawijaya @catlaughing Issues / bug found on this dataset.

  1. Batak (btk) cannot be loaded
  2. Unable to load complete language pairs
SamuelCahyawijaya commented 10 months ago

@fhudi : The issue has been fixed now. Kindly try updating to nusacrowd==0.1.2