SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for UniSent #626

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: unisent/unisent.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?unisent

Dataset unisent
Description UniSent is a universal sentiment lexica for 1000+ languages. To build UniSent, the authors use a massively parallel Bible corpus to project sentiment information from English to other languages for sentiment analysis on Twitter data. 173 of 1404 languages are spoken in Southeast Asia
Subsets -
Languages aaz, abx, ace, agn, agt, ahk, akb, alj, alp, amk, aoz, atb, atd, att, ban, bbc, bcl, bgr, bgs, bgz, bhp, bkd, bku, blw, blz, bpr, bps, bru, btd, bth, bto, bts, btx, bug, bvz, bzi, cbk, ceb, cfm, cgc, clu, cmo, cnh, cnw, csy, ctd, czt, dgc, dtp, due, duo, ebk, fil, gbi, gor, heg, hil, hnj, hnn, hvn, iba, ifa, ifb, ifk, ifu, ify, ilo, ind, iry, isd, itv, ium, ivb, ivv, jav, jra, kac, khm, kix, kje, kmk, kne, kqe, krj, ksc, ksw, kxm, lao, lbk, lew, lex, lhi, lhu, ljp, lus, mad, mak, mbb, mbd, mbf, mbi, mbs, mbt, mej, mkn, mnb, mog, mqj, mqy, mrw, msb, msk, msm, mta, mtg, mtj, mvp, mwq, mwv, mya, nbe, nfa, nia, nij, nlc, npy, obo, pag, pam, plw, pmf, pne, ppk, prf, prk, ptu, pww, sas, sbl, sda, sgb, smk, sml, sun, sxn, szb, tbl, tby, tcz, tdt, tgl, tha, tih, tlb, twu, urk, vie, war, whk, wrs, xbr, yli, yva, zom, zyp, pse, mnx, mmn, lsi, hlt, gdg, bnj, acn
Tasks Sentiment Analysis
License Creative Commons Attribution Non Commercial No Derivatives 4.0 (cc-by-nc-nd-4.0)
Homepage https://github.com/ehsanasgari/UniSent
HF URL -
Paper URL https://aclanthology.org/2020.lrec-1.506/
Gyyz commented 2 months ago

self-assign