SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_en_code_mixed

Dataset	id_en_code_mixed
Description	This dataset contain 825 tweet instances of Indonesian-English, corresponding to four NLP tasks, i.e., tokenization, language identification, lexical normalization, and word translation. Data for lexical normalization task is curated in MultiLexNorm (already in Nusa Catalogue), but other tasks are not. Tokenization for social media data is not as trivial as splitting the token using white space delimiter. In this data, language identification is performed in token-level granularity.
License	CC-BY-NC-SA 4.0

Dataset

id_en_code_mixed

Description

This dataset contain 825 tweet instances of Indonesian-English, corresponding to four NLP tasks, i.e., tokenization, language identification, lexical normalization, and word translation. Data for lexical normalization task is curated in MultiLexNorm (already in Nusa Catalogue), but other tasks are not. Tokenization for social media data is not as trivial as splitting the token using white space delimiter. In this data, language identification is performed in token-level granularity.

License

CC-BY-NC-SA 4.0

VanillaMacchiato commented 1 year ago

self-assign

haryoa commented 1 year ago

IndoNLP / nusa-crowd

Create dataset loader for id-en-code-mixed #303

self-assign

self-assign