deepset-ai / COVID-QA

API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.
Apache License 2.0
344 stars 121 forks source link

Integrate SIL language identification API #104

Closed dwhitena closed 4 years ago

dwhitena commented 4 years ago

This PR integrates more inclusive language identification as compared to cld2/3. To this end, the SIL Language Identification API is used as the default language identification model. This API supports 1035 languages currently including many lower resourced languages, and hopefully by using this language ID COVID-QA can start leveraging a variety of data gathered in lowered resourced languages (e.g., via elasticsearch). Some sources of this info are the SIL COVID resources, the endangered languages project, and this repo.

Important points

How to test

$ cd covid_nlp/language/
$ python detect_language.py

Other suggestions/ notes:

Supported languages in the API

Supported via Text Classification: biv, cub, cmo, hvn, kus, yal, pwg, myv, guo, des, leu, eip, cso, zia, kri, mca, kno, zza, maz, bps, qub, rmy, lvs, tab, nld, moa, ssg, maw, pww, sab, udm, zsm, zao, dzo, gnw, bru, kog, cwe, bim, tgo, mlh, blz, ckt, lok, smo, kpq, eng, nnq, kmr, pir, cab, tuo, bvc, xte, txu, pny, klv, jic, khm, mhy, yli, kha, dop, ojb, gvl, meq, cof, qvo, kqe, btd, bwu, nii, arb, xtn, top, lex, lob, cjp, por, ote, tmc, sun, grt, mcd, sja, naw, plw, zas, soq, khq, cek, ozm, kud, ted, bmq, pan, rjs, ktb, nhw, krj, ycn, ita, tir, prg, kpf, qup, msy, emp, ncu, qxo, ell, mzk, tim, yaz, dtb, upv, cou, noa, nhy, adh, cly, saj, fuq, rmo, gla, sim, apd, kpr, ota, kqp, gso, afr, kxc, mbt, wiu, pbb, cor, qul, gwr, twu, qve, arl, bku, alz, mto, bak, guc, lat, kgr, agm, cwt, iws, mip, ctp, khz, kyq, vie, dad, dug, yas, irk, kez, mza, nou, yue, law, kur, atg, mco, acr, lhu, myb, tik, djk, hae, tpp, yuj, mwq, rav, kzj, tuk, pbi, ffm, kmo, ybb, bgz, slk, cbv, gof, bjz, jiv, lln, xrb, cjo, qxn, prk, cot, xed, dgi, nsn, mpm, bzj, kne, cnl, bhw, gyr, akh, ntp, pls, aoz, som, tlf, xsb, eus, mfe, hak, aby, mej, myw, dsb, kru, snw, tpt, cle, nyy, tgp, agd, btt, mf1, quz, swg, sck, dyo, qvc, due, mmo, nca, oss, urt, hrv, btx, ban, pib, iri, sba, kub, lif, npl, icr, mbh, amu, sag, zpt, pss, gle, azz, hag, lzh, acu, ara, hns, zpq, mio, zty, cuc, usa, dan, miq, akb, nyo, cbi, caa, gdn, pms, mpt, wer, teo, ghs, mxt, fin, mjv, kwd, cax, zpl, ntr, ake, nog, tlj, aah, ach, mit, fij, apz, ceb, gde, gdr, mcp, cui, twb, mta, ncj, ino, men, mhi, mir, pez, quy, yre, asm, bdd, zpm, hot, zpi, kao, kyu, mvc, zpz, nzi, stp, srp, dik, guk, hat, zca, opm, aso, way, uig, krs, dig, sbl, glg, ava, avk, mkd, con, jac, mbb, heb, ces, mwv, wob, ddn, fuf, jbu, chr, kms, kwi, soy, qvn, rap, sxn, sgw, rel, ukr, gnd, bgt, thk, nob, dga, mie, orv, kyz, guh, pag, pse, tfr, cul, bhl, xsr, vag, qvw, nst, azg, muv, pad, cco, ese, gcf, pol, akp, sey, bex, vut, pam, lus, gvc, vol, stn, kdc, gym, med, wuv, gng, pui, kle, arz, myx, aak, hif, ian, sig, ign, mvp, xuo, kup, bbr, amf, zai, cya, nia, raw, nyf, ayp, czt, saq, zae, sah, kzf, swe, jam, poi, dob, hnn, mhr, okv, aze, gor, nij, aai, mkl, ron, isl, cpb, mup, nod, sus, knf, laj, nnb, tqo, bfd, cok, alj, pcm, kpw, myk, bbo, uvl, jbo, kia, kat, mux, agn, bjv, tly, mak, ixi, spp, xtd, ifu, urd, bom, bel, ruf, mhl, kek, bts, nhe, duo, mfz, otq, trs, old, bus, dbq, tcc, bba, cat, tee, cfm, bef, nwb, tca, dgz, cnk, crn, dah, chv, kwf, aom, bcl, nfr, fal, tpw, gos, crh, tnr, deu, yuw, oku, hoc, luc, rim, zar, ndy, pbc, udu, daa, miy, mog, obo, aia, knk, sgb, kbh, aoj, gaw, jvn, hsb, ljp, rnl, acc, avt, kbm, sbd, nhi, itv, yle, kbp, mzm, ame, amk, srn, ido, mqj, acm, box, xla, gag, tem, ses, boa, lmk, ker, bov, lew, bul, gbo, bmv, agu, aau, kkj, smt, ziw, ind, ter, hla, xsu, lef, qwh, zpu, xal, adj, gux, rus, ztq, kij, lgg, alp, frd, agr, miz, nin, mfq, gmv, urb, bpr, hye, boj, bua, wnu, naf, tgl, acd, sgz, lsm, yat, ton, for, fuv, wwa, tue, atq, iry, kyc, rai, pab, grn, hus, tav, lao, sda, tat, ilo, ury, nyn, lis, nkf, mtp, mxb, waj, kpz, aeu, krc, rwo, tbo, mai, avn, npy, vid, wba, mox, sne, yaa, hun, ben, mhx, viv, bav, vun, tuf, gur, cmr, cgc, sld, aon, ttc, ura, wap, dyi, gwi, ann, kue, quw, cbc, mfy, mtj, mya, mti, mgo, ppk, tac, est, ngp, pkb, zpo, cap, zab, fuh, tbc, dos, mag, mcu, xmm, cmn, mil, mww, apr, big, cdf, gvf, mda, lad, cnt, ipi, bon, kki, mqb, gum, kab, ctd, cme, ong, taj, usp, tpz, moz, ina, kvn, quf, thv, mlp, hin, sps, eka, bmr, sdm, mop, ubu, bnj, lem, gog, kbr, ahk, enb, gej, mif, uzb, ixl, dtp, yid, mnb, mpg, bss, ccp, muy, kto, avu, tos, dww, car, qvm, neb, csk, yrb, amn, jun, imo, nmz, gbi, maa, snc, lip, jpn, nak, bkv, awb, iba, mqf, tvw, xsm, cym, cuk, guu, mxv, nan, coe, mgh, msm, fra, sue, amr, rom, gai, kcg, mur, ctg, nlc, nch, yss, gfk, bos, myy, mxq, kpv, dnw, lac, shn, taq, tna, thl, kdj, spa, jav, anv, atb, cbs, ken, yam, asg, spl, zaw, gah, tnn, alt, enq, sqi, mib, yad, zyp, lit, sur, ife, ktj, ifk, nsu, abt, hne, rro, zpc, mfi, gun, hil, run, qxh, lia, dts, lee, ltz, mzw, pis, epo, ptp, tzj, chz, nim, pes, tzt, ngu, mor, pao, wmw, dsh, bwq, sny, zaa, ber, yut, keo, faa, kxm, ndz, arq, not, cko, ceg, dgk, gqr, tlb, bxr, kaq, mnf, jmc, mar, muh, inb, knj, tha, prf, mon, nnw, nhu, mfh, bcw, bre, kmd, acn, quc, hig, pah, sri, bfo, ade, bgr, rmc, cjv, auy, amh, war, guq, bud, tby, tlh, hrx, pmf, kwj, awa, mee, vmy, cpa, heh, dwr, cni, gui, kje, sas, srm, wuu, buk, lgl, xav, kyf, lwo, mal, fai, far, lww, oci, blt, pau, hto, cbr, abi, mbc, mim, rej, sml, min, yby, nno, lfn, roo, kor, yva, toc, tnk, knv, bvz, nds, gna, nhx, nuj, kjh, urk, gub, amm, nho, huu, qvz, mva, ile, grc, bao, mfk, sil, cbk, sll, snn, mcq, mek, slv, ksr, qvs, kaz, kqy, bkl, bib, tur, yml, suk, kaa, huv, krl, bmh, kze, csb, ape, ppo, ttr, ndj, hub, tte, ess, zos, nvm

Supported via rule-based methods (based on unicode blocks and writing system scripts): xsr, ind, cmo, lif, ron, ojb, nod, mww, men, jun, rus, btd, jpn, hil, arb, pol, run, mai, alt, kyu, amh, taj, war, nld, pam, ljp, bud, lus, grt, mak, sun, akb, dzo, bku, urd, bru, tgl, pag, som, kbp, arz, pan, bts, vie, sas, gag, kxm, mjv, taq, fuf, ita, chr, bul, mya, jav, atb, blt, ceb, ccp, bcl, lao, ilo, mar, oss, hnn, btx, rej, ban, lis

Timoeller commented 4 years ago

Hey @dwhitena thanks for this PR, looking forward getting more languages into Covid QA!

dwhitena commented 4 years ago

Regarding your questions @Timoeller:

Timoeller commented 4 years ago

Hey @dwhitena sorry for the delay.

Concerning your questions.

Regarding integrating "wash your hands" translations into our UI. It will be fast and easy to insert your poster elements CSV into our backend, display the "wash your hands" translation in the specific language and link to your website. Is there already an CSV with more translations publicly available?

Timoeller commented 4 years ago

I tested the SIL lang detection and realized it is slowing down the requests and might be unstable for some languages in our databse:

So I changed the code to just query SIL language detection in case the questions do not match any languages present in our DB. See #106
Hope that is fine with you?


Getting the changes into the actual UI will require an update. I will talk to our data engineer start of next week to do it.

Timoeller commented 4 years ago

Closing this now as changes are taken over by #106

dwhitena commented 4 years ago

@Timoeller Thanks very much for the update here. I'm going to circle back around to the API this week with this feedback and make some updates. It's very helpful to get some usage to help us upgrade the performance.