IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
262 stars 62 forks source link

Create dataset loader for IJELID (Indonesian-Javanese-English Code-Mixed Language Identification) #345

Open SamuelCahyawijaya opened 1 year ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?ijelid

Dataset ijelid
Description This is a clean version of code-mixed Indonesian-Javanese-English data for token level language identification. We name this dataset as IJELID (Indonesian-Javanese-English Language Identification). This dataset contains tweets that have been tokenized with the corresponding token and its language label. There are seven language labels in the dataset, namely: ID (Indonesian), JV (Javanese), EN (English), MIX_ID_EN (mixed Indonesian-English), MIX_ID_JV (mixed Indonesian-Javanese), MIX_JV_EN (mixed Javanese-English), OTH (Other).
License CC-BY 4.0