Support the full range of languages in the languages picker

alexwlchan commented 11 months ago

So if you use the file caption picker on commons.wikimedia.org, it gives you a popout component with hundreds of languages in a scrolling list. That's a lot!

We use the wbsetlabel API for setting file captions; here's a full list of languages it supports:

aa, ab, abs, ace, acm, ady, ady-cyrl, aeb, aeb-arab, aeb-latn, af, agq, ak, aln, als, alt, am, ami, an, ang, ann, anp, ar, arc, arn, arq, ary, arz, as, ase, ast, atj, av, avk, awa, ay, az, azb, ba, bag, ban, ban-bali, bar, bas, bat-smg, bax, bbc, bbc-latn, bbj, bcc, bci, bcl, bdr, be, be-tarask, be-x-old, bew, bfd, bg, bgn, bh, bho, bi, bjn, bkc, bkh, bkm, blk, bm, bn, bo, bpy, bqi, bqz, br, brh, bs, btm, bto, bug, bxr, byv, ca, cak, cal, cbk-zam, cdo, ce, ceb, ch, cho, chr, chy, ckb, cnh, co, cps, cpx, cpx-hans, cpx-hant, cpx-latn, cr, crh, crh-cyrl, crh-latn, crh-ro, cs, csb, cu, cv, cy, da, dag, de, de-at, de-ch, de-formal, dga, din, diq, dsb, dtp, dty, dua, dv, dz, ee, egl, el, eml, en, en-ca, en-gb, en-us, eo, es, es-419, es-formal, et, eto, etu, eu, ewo, ext, fa, fat, ff, fi, fit, fiu-vro, fj, fkv, fmp, fo, fon, fr, frc, frp, frr, fur, fy, ga, gaa, gag, gan, gan-hans, gan-hant, gcr, gd, gl, gld, glk, gn, gom, gom-deva, gom-latn, gor, got, gpe, grc, gsw, gu, guc, gur, guw, gv, gya, ha, hak, haw, he, hi, hif, hif-latn, hil, hno, ho, hr, hrx, hsb, hsn, ht, hu, hu-formal, hy, hyw, hz, ia, id, ie, ig, igl, ii, ik, ike-cans, ike-latn, ilo, inh, io, is, isu, it, iu, ja, jam, jbo, jut, jv, ka, kaa, kab, kai, kbd, kbd-cyrl, kbp, kcg, kea, ker, kg, khw, ki, kiu, kj, kjh, kjp, kk, kk-arab, kk-cn, kk-cyrl, kk-kz, kk-latn, kk-tr, kl, km, kn, ko, ko-kp, koi, kr, krc, kri, krj, krl, ks, ks-arab, ks-deva, ksf, ksh, ksw, ku, ku-arab, ku-latn, kum, kus, kv, kw, ky, la, lad, lb, lbe, lem, lez, lfn, lg, li, lij, liv, lki, lld, lmo, ln, lns, lo, loz, lrc, lt, ltg, lus, luz, lv, lzh, lzz, mad, mag, mai, map-bms, mcn, mcp, mdf, mg, mh, mhr, mi, min, mk, ml, mn, mnc, mnc-latn, mnc-mong, mni, mnw, mo, mos, mr, mrh, mrj, ms, ms-arab, mt, mua, mus, mwl, my, myv, mzn, na, nah, nan, nan-hani, nap, nb, nds, nds-nl, ne, new, ng, nge, nia, niu, nl, nl-informal, nla, nmg, nmz, nn, nnh, nnz, no, nod, nog, nov, nqo, nrm, nso, nv, ny, nyn, nys, oc, ojb, olo, om, or, os, osa-latn, ota, pa, pag, pam, pap, pap-aw, pcd, pcm, pdc, pdt, pfl, pi, pih, pl, pms, pnb, pnt, prg, ps, pt, pt-br, pwn, qu, quc, qug, rgn, rif, rki, rm, rmc, rmf, rmy, rn, ro, roa-rup, roa-tara, rsk, ru, rue, rup, ruq, ruq-cyrl, ruq-latn, rw, rwr, ryu, sa, sah, sat, sc, scn, sco, sd, sdc, sdh, se, se-fi, se-no, se-se, sei, ses, sg, sgs, sh, sh-cyrl, sh-latn, shi, shi-latn, shi-tfng, shn, shy, shy-latn, si, simple, sjd, sje, sju, sk, skr, skr-arab, sl, sli, sm, sma, smj, smn, sms, sn, so, sq, sr, sr-ec, sr-el, srn, sro, srq, ss, st, stq, sty, su, sv, sw, syl, szl, szy, ta, tay, tcy, tdd, te, tet, tg, tg-cyrl, tg-latn, th, ti, tk, tl, tly, tly-cyrl, tn, to, tok, tpi, tpv, tr, tru, trv, ts, tt, tt-cyrl, tt-latn, tum, tvu, tw, ty, tyv, tzm, udm, ug, ug-arab, ug-latn, uk, ur, uz, uz-cyrl, uz-latn, ve, vec, vep, vi, vls, vmf, vmw, vo, vot, vro, vut, wa, wal, war, wes, wls, wo, wuu, wuu-hans, wuu-hant, wya, xal, xh, xmf, xsy, yas, yat, yav, ybb, yi, yo, yrl, yue, yue-hans, yue-hant, za, zea, zgh, zh, zh-classical, zh-cn, zh-hans, zh-hant, zh-hk, zh-min-nan, zh-mo, zh-my, zh-sg, zh-tw, zh-yue, zu

There's a Wikimedia language code property here, which we could use to look these up: https://www.wikidata.org/wiki/Property:P424

alexwlchan commented 11 months ago

There are 640 languages in the picker on Wikimedia Commons. 😱

There are 576 languages that can be used in the API (some languages appear more than once in the WMC list).

I'm running a script to analyse the captions on Commons: I've looked at ~10% of the files so far, and there are 372 different languages in use.

alexwlchan commented 11 months ago

So here's some back-of-the-napkin analysis.

I analysed the captions on the first ~30M files, which comes to ~4M captions.

I made a tally of the languages in use – there are captions in (at least) 439 languages, but the distribution is far from even. This graph shows a percentage of overall captions, compared to the number of languages you include:

Unsurprisingly, English is the biggest and has 64% of captions. Adding German gets you to 73%, French to 78%, Spanish to 81%, and so on. But it flattens out pretty quickly:

The top 10 languages cover 89.9% of captions
The top 20 languages cover 94.4% of captions
The top 30 languages cover 96.6% of captions

And these numbers are broadly stable – I originally calculated them for the first ~1.5M captions, and they didn't change much in the next 2.5M.

Based on these numbers, I think a sensible V1 for languages would be a simple dropdown picker with the top 30 or so languages. That's fairly quick and easy to build from what we already have.

alexwlchan commented 11 months ago

l o l

Never trust a software developer who says something will be easy. This works, kinda, but it's a crappy UI because the

Flickr-Foundation / flickypedia

Support the full range of languages in the languages picker #235