Closed alexwlchan closed 11 months ago
There are 640 languages in the picker on Wikimedia Commons. 😱
There are 576 languages that can be used in the API (some languages appear more than once in the WMC list).
I'm running a script to analyse the captions on Commons: I've looked at ~10% of the files so far, and there are 372 different languages in use.
So here's some back-of-the-napkin analysis.
I analysed the captions on the first ~30M files, which comes to ~4M captions.
I made a tally of the languages in use – there are captions in (at least) 439 languages, but the distribution is far from even. This graph shows a percentage of overall captions, compared to the number of languages you include:
Unsurprisingly, English is the biggest and has 64% of captions. Adding German gets you to 73%, French to 78%, Spanish to 81%, and so on. But it flattens out pretty quickly:
And these numbers are broadly stable – I originally calculated them for the first ~1.5M captions, and they didn't change much in the next 2.5M.
Based on these numbers, I think a sensible V1 for languages would be a simple dropdown picker with the top 30 or so languages. That's fairly quick and easy to build from what we already have.
l o l
Never trust a software developer who says something will be easy. This works, kinda, but it's a crappy UI because the
Here's a SPARQL query to find all the Wikidata entities with a language code:
SELECT ?item ?value
WHERE { ?item wdt:P424 ?value }
LIMIT 5000
And this page also looks useful: https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all
So if you use the file caption picker on commons.wikimedia.org, it gives you a popout component with hundreds of languages in a scrolling list. That's a lot!
We use the wbsetlabel API for setting file captions; here's a full list of languages it supports:
There's a Wikimedia language code property here, which we could use to look these up: https://www.wikidata.org/wiki/Property:P424