SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
63 stars 57 forks source link

Create dataset loader for LTI LangID Corpus #364

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: lti_langid_corpus/lti_langid_corpus.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lti_langid_corpus

Dataset lti_langid_corpus
Description The LTI LangID corpus is a dataset for language identification. The most recent version, v5, contains training data for 1266 languages, and some (possibly very tiny) amount of text for a total of 1706 languages. The corpus defines a train:test split, and, when the language in question has a sufficient amount of data, a dev split as well.
Subsets -
Languages ifa, ace, btm, mqj, mbs, nbq, gor, zyp, tbl, kjp, kmk, kqe, ptu, blw, ceb, prf, yva, zlm, bps, tdt, mya, dgc, lus, wrs, abx, rgu, aaz, agn, ccp, cmn, jav, obo, due, msk, xsb, syb, ind, lbk, min, smk, att, nod, tdj, atb, atd, pag, hvn, ksc, lao, kkl, lti, dao, cgc, tbk, gdg, amk, mbd, clu, msb, sbl, cek, khm, yue, sgb, beu, eip, ifu, mnw, suc, bgs, pam, ebk, eng, nfa, cbk, ify, csy, heg, shn, mta, mbt, lex, tha, mmn, sun, vie, llg, xnn, txq, bcl, kje, kne, san, hlt, kyu, bkd, duo, tet, ury, yka, bjn, tiy, ivv, agt, ban, blz, mbi, ilo, mkn, isd, cth, bpr, por, mbb, tgl, msm, ivb, tam, plw, alp, row
Tasks Language Identification, Language Modeling
License Other (other)
Homepage https://www.cs.cmu.edu/~ralf/langid.html
HF URL -
Paper URL https://aclanthology.org/D14-1069.pdf
ssun32 commented 7 months ago

self-assign