SSHOC / vocabularies

0 stars 0 forks source link

Vocabulary languages misses a lot entries #5

Closed dpancic closed 1 year ago

dpancic commented 4 years ago

In GitLab by @KlausIllmayer on Jun 15, 2020, 16:04

Currently, the vocabulary iso-639-3 (languages) only have some items available (https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/vocabularies/iso-639-3). There should be a lot more of them, especially we are currently in need of "mul" (multiple) and "spa" (spanish), as they are used by the curated items. Can we extend the vocabulary at least with this two terms and in the long term, we should have a more exhaustive list e.g. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Languages/List_of_ISO_639-3_language_codes_(2019)

dpancic commented 3 years ago

In GitLab by @KlausIllmayer on Feb 3, 2021, 10:17

Update of this issue: we now have iso-639-3-v2 that seems to be comprehensive. Only problem with the current version of is-639-3-v2 is that the labels are in German and not in English - and it looks also to me that we do have an encoding problem here (not sure if this is a problem of the ingest of the vocabulary or of the API endpoint), have a look here: https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/vocabularies/iso-639-3-v2?perpage=100&page=1 You will see that it has German labels like "Sprachcodes" or "Arvanitisch" and you will see for the entry 14 (code: "aaq") that the label "Östliche Abenaki" does not show up correct due to enconding I guess.

@vronk Is there a chance to get the iso-639-3-v2 in an English version? @tparkola Can you identify the encoding issue within the backend? Both: Should we delete the old iso-639-3 vocabulary and if so, should we then rename the current iso-639-3-v2 to iso-639-3?

Also bringing @vronk into this discussion.

dpancic commented 3 years ago

In GitLab by @vronk on Feb 3, 2021, 11:13

@KlausIllmayer The issue is on the backend. It does not support multilingual labels on the vocabularies and if I am not wrong in any model. So when it loads the vocabulary, it gets the first label it finds.

The vocabulary iso-639-3-v2 contains labels in 3 languages, German, English and French.

A workaround to this would be to filter out non-english literals from the vocabulary.

https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/raw/b185776ea919d3182871a919d88e09000832467e/src/main/resources/initial-data/vocabularies/iso-639-3-v2.ttl

dpancic commented 3 years ago

In GitLab by @tparkola on Jun 24, 2021, 14:51

Now (after tasks https://gitlab.gwdg.de/sshoc/vocabularies/-/issues/21 and https://gitlab.gwdg.de/sshoc/vocabularies/-/issues/19) labels in all languages are loaded and English labels are assigned to vocabularies and concepts fields. The vocabulary https://gitlab.gwdg.de/sshoc/vocabularies/-/blob/master/iso-639-3/iso-639-3.ttl can be loaded by API so I think this task can be closed.