IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Create dataset loader for id-en-code-mixed #303

Open SamuelCahyawijaya opened 1 year ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_en_code_mixed

Dataset id_en_code_mixed
Description This dataset contain 825 tweet instances of Indonesian-English, corresponding to four NLP tasks, i.e., tokenization, language identification, lexical normalization, and word translation. Data for lexical normalization task is curated in MultiLexNorm (already in Nusa Catalogue), but other tasks are not. Tokenization for social media data is not as trivial as splitting the token using white space delimiter. In this data, language identification is performed in token-level granularity.
License CC-BY-NC-SA 4.0
VanillaMacchiato commented 1 year ago

self-assign

haryoa commented 1 year ago

self-assign