SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for Taxi1500 v2 #520

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: taxi1500_v2/taxi1500_v2.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?taxi1500_v2

Dataset taxi1500_v2
Description Taxi1500 is a text classification dataset for evaluating the cross-lingual generalization ability of multilingual pre-trained language models. It introduces a sentence classification task with 6 topics and covers 1502 typologically diverse languages spanning 112 language families.
Subsets -
Languages aaz, zyp, zsm, yva, xnn, wrs, tiy, tha, tgl, tbl, tbk, smk, ptu, prf, obo, nlc, nbq, mya, msm, msk, msb, mqj, mkn, mbt, mbs, lex, lbk, ksw, kmk, kkl, kje, heg, fil, ebk, dgc, clu, ceb, cbk, blz, blw, bkd, bgs, att, atd, atb, amk, alp, agt, abx, agn
Tasks Text Classification
License Unknown (unknown)
Homepage Text: https://github.com/cisnlp/Taxi1500/tree/main/Taxi1500-c_v2.0, Label: https://github.com/cisnlp/Taxi1500/tree/main/corpus_obtain/train_dev_test
HF URL -
Paper URL https://arxiv.org/abs/2305.08487
tellarin commented 5 months ago

self-assign