Closed dwhitena closed 4 years ago
Hey @dwhitena thanks for this PR, looking forward getting more languages into Covid QA!
You are right about non English languages being routed to Elasticsearch for plain text matching. As suggested we should have the languages available in our database (for now it is German, Swedish, Polish and Italian) and route languages not present to other sources. I will work on this the coming week.
Do you think the rate limit on the API could be a problem? I have seen when requesting the key: Query rate limits: 3 / second 200 / day
Regarding your questions @Timoeller:
On the Elasticsearch method, do we need to supply a language code to Elasticsearch? Or does it do its search auto-magically in some way?
Regarding the rate limit, I have control over that and can update it. Based on how you have seen COVID-QA being used, what might you expect in terms of traffic?
Hey @dwhitena sorry for the delay.
Concerning your questions.
Regarding integrating "wash your hands" translations into our UI. It will be fast and easy to insert your poster elements CSV into our backend, display the "wash your hands" translation in the specific language and link to your website. Is there already an CSV with more translations publicly available?
I tested the SIL lang detection and realized it is slowing down the requests and might be unstable for some languages in our databse:
So I changed the code to just query SIL language detection in case the questions do not match any languages present in our DB. See #106
Hope that is fine with you?
Getting the changes into the actual UI will require an update. I will talk to our data engineer start of next week to do it.
Closing this now as changes are taken over by #106
@Timoeller Thanks very much for the update here. I'm going to circle back around to the API this week with this feedback and make some updates. It's very helpful to get some usage to help us upgrade the performance.
This PR integrates more inclusive language identification as compared to cld2/3. To this end, the SIL Language Identification API is used as the default language identification model. This API supports 1035 languages currently including many lower resourced languages, and hopefully by using this language ID COVID-QA can start leveraging a variety of data gathered in lowered resourced languages (e.g., via elasticsearch). Some sources of this info are the SIL COVID resources, the endangered languages project, and this repo.
Important points
config.py
. These can be obtained for free at developers.sil.orgHow to test
Other suggestions/ notes:
Supported languages in the API
Supported via Text Classification: biv, cub, cmo, hvn, kus, yal, pwg, myv, guo, des, leu, eip, cso, zia, kri, mca, kno, zza, maz, bps, qub, rmy, lvs, tab, nld, moa, ssg, maw, pww, sab, udm, zsm, zao, dzo, gnw, bru, kog, cwe, bim, tgo, mlh, blz, ckt, lok, smo, kpq, eng, nnq, kmr, pir, cab, tuo, bvc, xte, txu, pny, klv, jic, khm, mhy, yli, kha, dop, ojb, gvl, meq, cof, qvo, kqe, btd, bwu, nii, arb, xtn, top, lex, lob, cjp, por, ote, tmc, sun, grt, mcd, sja, naw, plw, zas, soq, khq, cek, ozm, kud, ted, bmq, pan, rjs, ktb, nhw, krj, ycn, ita, tir, prg, kpf, qup, msy, emp, ncu, qxo, ell, mzk, tim, yaz, dtb, upv, cou, noa, nhy, adh, cly, saj, fuq, rmo, gla, sim, apd, kpr, ota, kqp, gso, afr, kxc, mbt, wiu, pbb, cor, qul, gwr, twu, qve, arl, bku, alz, mto, bak, guc, lat, kgr, agm, cwt, iws, mip, ctp, khz, kyq, vie, dad, dug, yas, irk, kez, mza, nou, yue, law, kur, atg, mco, acr, lhu, myb, tik, djk, hae, tpp, yuj, mwq, rav, kzj, tuk, pbi, ffm, kmo, ybb, bgz, slk, cbv, gof, bjz, jiv, lln, xrb, cjo, qxn, prk, cot, xed, dgi, nsn, mpm, bzj, kne, cnl, bhw, gyr, akh, ntp, pls, aoz, som, tlf, xsb, eus, mfe, hak, aby, mej, myw, dsb, kru, snw, tpt, cle, nyy, tgp, agd, btt, mf1, quz, swg, sck, dyo, qvc, due, mmo, nca, oss, urt, hrv, btx, ban, pib, iri, sba, kub, lif, npl, icr, mbh, amu, sag, zpt, pss, gle, azz, hag, lzh, acu, ara, hns, zpq, mio, zty, cuc, usa, dan, miq, akb, nyo, cbi, caa, gdn, pms, mpt, wer, teo, ghs, mxt, fin, mjv, kwd, cax, zpl, ntr, ake, nog, tlj, aah, ach, mit, fij, apz, ceb, gde, gdr, mcp, cui, twb, mta, ncj, ino, men, mhi, mir, pez, quy, yre, asm, bdd, zpm, hot, zpi, kao, kyu, mvc, zpz, nzi, stp, srp, dik, guk, hat, zca, opm, aso, way, uig, krs, dig, sbl, glg, ava, avk, mkd, con, jac, mbb, heb, ces, mwv, wob, ddn, fuf, jbu, chr, kms, kwi, soy, qvn, rap, sxn, sgw, rel, ukr, gnd, bgt, thk, nob, dga, mie, orv, kyz, guh, pag, pse, tfr, cul, bhl, xsr, vag, qvw, nst, azg, muv, pad, cco, ese, gcf, pol, akp, sey, bex, vut, pam, lus, gvc, vol, stn, kdc, gym, med, wuv, gng, pui, kle, arz, myx, aak, hif, ian, sig, ign, mvp, xuo, kup, bbr, amf, zai, cya, nia, raw, nyf, ayp, czt, saq, zae, sah, kzf, swe, jam, poi, dob, hnn, mhr, okv, aze, gor, nij, aai, mkl, ron, isl, cpb, mup, nod, sus, knf, laj, nnb, tqo, bfd, cok, alj, pcm, kpw, myk, bbo, uvl, jbo, kia, kat, mux, agn, bjv, tly, mak, ixi, spp, xtd, ifu, urd, bom, bel, ruf, mhl, kek, bts, nhe, duo, mfz, otq, trs, old, bus, dbq, tcc, bba, cat, tee, cfm, bef, nwb, tca, dgz, cnk, crn, dah, chv, kwf, aom, bcl, nfr, fal, tpw, gos, crh, tnr, deu, yuw, oku, hoc, luc, rim, zar, ndy, pbc, udu, daa, miy, mog, obo, aia, knk, sgb, kbh, aoj, gaw, jvn, hsb, ljp, rnl, acc, avt, kbm, sbd, nhi, itv, yle, kbp, mzm, ame, amk, srn, ido, mqj, acm, box, xla, gag, tem, ses, boa, lmk, ker, bov, lew, bul, gbo, bmv, agu, aau, kkj, smt, ziw, ind, ter, hla, xsu, lef, qwh, zpu, xal, adj, gux, rus, ztq, kij, lgg, alp, frd, agr, miz, nin, mfq, gmv, urb, bpr, hye, boj, bua, wnu, naf, tgl, acd, sgz, lsm, yat, ton, for, fuv, wwa, tue, atq, iry, kyc, rai, pab, grn, hus, tav, lao, sda, tat, ilo, ury, nyn, lis, nkf, mtp, mxb, waj, kpz, aeu, krc, rwo, tbo, mai, avn, npy, vid, wba, mox, sne, yaa, hun, ben, mhx, viv, bav, vun, tuf, gur, cmr, cgc, sld, aon, ttc, ura, wap, dyi, gwi, ann, kue, quw, cbc, mfy, mtj, mya, mti, mgo, ppk, tac, est, ngp, pkb, zpo, cap, zab, fuh, tbc, dos, mag, mcu, xmm, cmn, mil, mww, apr, big, cdf, gvf, mda, lad, cnt, ipi, bon, kki, mqb, gum, kab, ctd, cme, ong, taj, usp, tpz, moz, ina, kvn, quf, thv, mlp, hin, sps, eka, bmr, sdm, mop, ubu, bnj, lem, gog, kbr, ahk, enb, gej, mif, uzb, ixl, dtp, yid, mnb, mpg, bss, ccp, muy, kto, avu, tos, dww, car, qvm, neb, csk, yrb, amn, jun, imo, nmz, gbi, maa, snc, lip, jpn, nak, bkv, awb, iba, mqf, tvw, xsm, cym, cuk, guu, mxv, nan, coe, mgh, msm, fra, sue, amr, rom, gai, kcg, mur, ctg, nlc, nch, yss, gfk, bos, myy, mxq, kpv, dnw, lac, shn, taq, tna, thl, kdj, spa, jav, anv, atb, cbs, ken, yam, asg, spl, zaw, gah, tnn, alt, enq, sqi, mib, yad, zyp, lit, sur, ife, ktj, ifk, nsu, abt, hne, rro, zpc, mfi, gun, hil, run, qxh, lia, dts, lee, ltz, mzw, pis, epo, ptp, tzj, chz, nim, pes, tzt, ngu, mor, pao, wmw, dsh, bwq, sny, zaa, ber, yut, keo, faa, kxm, ndz, arq, not, cko, ceg, dgk, gqr, tlb, bxr, kaq, mnf, jmc, mar, muh, inb, knj, tha, prf, mon, nnw, nhu, mfh, bcw, bre, kmd, acn, quc, hig, pah, sri, bfo, ade, bgr, rmc, cjv, auy, amh, war, guq, bud, tby, tlh, hrx, pmf, kwj, awa, mee, vmy, cpa, heh, dwr, cni, gui, kje, sas, srm, wuu, buk, lgl, xav, kyf, lwo, mal, fai, far, lww, oci, blt, pau, hto, cbr, abi, mbc, mim, rej, sml, min, yby, nno, lfn, roo, kor, yva, toc, tnk, knv, bvz, nds, gna, nhx, nuj, kjh, urk, gub, amm, nho, huu, qvz, mva, ile, grc, bao, mfk, sil, cbk, sll, snn, mcq, mek, slv, ksr, qvs, kaz, kqy, bkl, bib, tur, yml, suk, kaa, huv, krl, bmh, kze, csb, ape, ppo, ttr, ndj, hub, tte, ess, zos, nvm
Supported via rule-based methods (based on unicode blocks and writing system scripts): xsr, ind, cmo, lif, ron, ojb, nod, mww, men, jun, rus, btd, jpn, hil, arb, pol, run, mai, alt, kyu, amh, taj, war, nld, pam, ljp, bud, lus, grt, mak, sun, akb, dzo, bku, urd, bru, tgl, pag, som, kbp, arz, pan, bts, vie, sas, gag, kxm, mjv, taq, fuf, ita, chr, bul, mya, jav, atb, blt, ceb, ccp, bcl, lao, ilo, mar, oss, hnn, btx, rej, ban, lis