Encoding issues for external vocab import

twagoo commented 7 years ago

Thomas reports the following in the results of the test plan:

External vocabulary for ISO 639-3 can’t be set: “All items in a vocabulary should have a unique value. Please remove these duplicate values and try again: Arh?,Bar?,Pankarar?

The cause of this is that a few language names end up clashing due to encoding issues somewhere in the pipeline of importing the vocabulary from CLAVAS. Needs to be investigated.

menzowindhouwer commented 7 years ago

I'm not exactly sure if its related, but an old version of the CLAVAS vocabulary didn't have unique prefLabels for all concepts (although each one had an unique code). A new version of the ISO 639-3 CLAVAS vocabulary where each code has an unique preflabel has been created and will be imported into the OpenSKOS 2 test server.

twagoo commented 7 years ago

Thanks @menzowindhouwer for that info, I will consider this in my investigation!

olhsha commented 7 years ago

I have just updated the CLAVAS on http://145.100.58.79/clavas/public/api/ with the corrected import file for languages.

twagoo commented 7 years ago

Thanks. The issue still persists. That is, locally and on alpha the issue does not occur but on the dev-sp host it does. All are configured to use the testing instance of CLAVAS. Next step: investigate whether the issue occurs in the communication between the back end and CLAVAS or in the communication between the front end and back end.

twagoo commented 7 years ago

The issue seems to be in the communication between the back end and CLAVAS. The different hosts show different responses:

localhost/alpha (request):

[{"prefLabel@en":"Abkhazian","uri":"http://cdb.iso.org/lg/CDB-00138467-001"},{"prefLabel@en":"Adyghe","uri":"http://cdb.iso.org/lg/CDB-00133873-001"},{"prefLabel@en":"Saint Lucian Creole French","uri":"http://cdb.iso.org/lg/CDB-00133907-001"},{"prefLabel@en":"Adamorobe Sign Language","uri":"http://cdb.iso.org/lg/CDB-00133878-001"},{"prefLabel@en":"Argentine Sign Language","uri":"http://cdb.iso.org/lg/CDB-00133965-001"},{"prefLabel@en":"Algerian Saharan Arabic","uri":"http://cdb.iso.org/lg/CDB-00133758-001"},{"prefLabel@en":"Ta'izzi-Adeni Arabic","uri":"http://cdb.iso.org/lg/CDB-00133893-001"},{"prefLabel@en":"Mesopotamian Arabic","uri":"http://cdb.iso.org/lg/CDB-00133905-001"},{"prefLabel@en":"Arvanitika Albanian","uri":"http://cdb.iso.org/lg/CDB-00133781-001"},{"prefLabel@en":"ArbÃ«reshÃ« Albanian","uri":"http://cdb.iso.org/lg/CDB-00133767-001"},
...

dev-sp (request):

[{"prefLabel@en":"Abkhazian","uri":"http://cdb.iso.org/lg/CDB-00138467-001"},{"prefLabel@en":"Adyghe","uri":"http://cdb.iso.org/lg/CDB-00133873-001"},{"prefLabel@en":"Saint Lucian Creole French","uri":"http://cdb.iso.org/lg/CDB-00133907-001"},{"prefLabel@en":"Adamorobe Sign Language","uri":"http://cdb.iso.org/lg/CDB-00133878-001"},{"prefLabel@en":"Argentine Sign Language","uri":"http://cdb.iso.org/lg/CDB-00133965-001"},{"prefLabel@en":"Algerian Saharan Arabic","uri":"http://cdb.iso.org/lg/CDB-00133758-001"},{"prefLabel@en":"Ta'izzi-Adeni Arabic","uri":"http://cdb.iso.org/lg/CDB-00133893-001"},{"prefLabel@en":"Mesopotamian Arabic","uri":"http://cdb.iso.org/lg/CDB-00133905-001"},{"prefLabel@en":"Arvanitika Albanian","uri":"http://cdb.iso.org/lg/CDB-00133781-001"},{"prefLabel@en":"Arb?resh? Albanian","uri":"http://cdb.iso.org/lg/CDB-00133767-001"},
...

compare the last shown entries: "ArbÃ«reshÃ« Albanian" (sic, display issue) vs "Arb?resh? Albanian".

twagoo commented 7 years ago

https://github.com/clarin-eric/component-registry-rest/commit/fb3be688175f9f887006e247dec4f3cf200fc312 adds better content-type headers to the responses of the vocabulary service. This might solve the problem, but we will have to check on dev-sp as so far I cannot reproduce it anywhere else :(

olhsha commented 7 years ago

I have just tried the request on that Albanian dialect with (of Italian Albanians),

http://145.100.58.79/clavas/public/api/find-concepts?q=prefLabel:%22Arb%C3%ABresh%C3%AB%20Albanian%22&format=json

The response (FF) contains yet another, "escape", representation of ё: prefLabel@en":"Arb\u00ebresh\u00eb Albanian". Do I need to decode \u00-sequences for json responses on our (CLAVAS, CCR) backend or add a decoding option? Note, that format=rdf (default) looks good.

twagoo commented 7 years ago

@olhsha

Do I need to decode \u00-sequences for json responses on our (CLAVAS, CCR) backend or add a decoding option? Note, that format=rdf (default) looks good.

Providing the actual character using UTF-8 would work but it looks like my JSON parser is capable of decoding these sequences as well because it shows the right characters in the right places (except for the beta at dev-sp.clarin.eu, which substitutes them for question marks)

twagoo commented 7 years ago

it shows the right characters in the right places

You can now see this better since the 'proxy service' (example) sends the correct Content-Type header). This will be deployed to dev-sp next Friday, so we will know whether this helps for this issue then.

twagoo commented 7 years ago

Recent changes did not solve it. Some things to try out here: http://stackoverflow.com/a/138950

twagoo commented 7 years ago

Confirmed: running a local container based on the beta image has the same issue.

twagoo commented 7 years ago

Fixed in back end in https://github.com/clarin-eric/component-registry-rest/commit/c598d99e7717b98171c575fab03ee018091a7b08

clarin-eric / component-registry-front-end

Encoding issues for external vocab import #108