DDMAL / VIM

The Virtual Instrument Museum website repository
MIT License
0 stars 2 forks source link

Import language data to VIM database #27

Open dchiller opened 10 months ago

dchiller commented 10 months ago

VIM's language model requires language data from Wikidata to define the domain of VIM-supported languages. We need to create an import process for this language data from Wikidata to VIM.

dchiller commented 10 months ago

Upon a few hours of investigation, I've found it difficult to get an exact handle on Wikidata's support for languages.

The most straightforward list of supported languages is in this Wikidata Help page. The table shown is the result of a SPARQL query that looks for Wikidata items with P424 property (P424 is the property "Wikimedia language code") and does a little bit of filtering. The same language may have multiple subtypes if it can be written in multiple scripts (for example, if the language can be written both in a latin and non-latin script).

This resource directs you to a number of places where various levels of language support on MediaWiki (the software on which Wikidata runs) can be found. According to this document, there are a number of different locations where language data is stored. One such location is the Names.php file. This seems to be a step further along in the "adding a language" process and therefore one of the more suitable candidates for determining what is supported. Another good candidate would be languages added to the Universal Language Selector (this is the selector at the top of Wikidata that allows you to set a site language). These can be found here.

I did a quick survey to determine if using the first table as our source of languages would unduly limit languages we had in VIM. On the one hand, it is the most directly related to Wikidata and results from a query that would be easy to send whenever we wanted to update VIM's languages. On the other hand, my worry was that maybe languages added in Wikidata (and therefore available in the results of this query) would lag behind supported languages.

Languages available in Names.php: 520 Languages in table on Wikidata Help page: 700 including sublabels (eg. scripts as explained above); 573 not including sublabels Languages in Universal Language Selector: 758 (these also include sublabels)

I'm working on analyzing which languages are missing from the latter two options.

fujinaga commented 10 months ago

How about we use items with P424 property that have IETF codes?

dchiller commented 10 months ago

How about we use items with P424 property that have IETF codes?

Unfortunately, there are many more languages on Wikidata with IETF codes than Wikidata actually supports (my query got 7500). I suspect that most VIM activity would be in the most (say) 100 common languages or something like that, but it would be sad if there was a language that Wikidata did support that VIM didn't and someone came along who knew that language.

I'm putting this aside for now and just having two languages (English and French) and trying to load all the instruments...then we'll come back to this.