I have been using the ckanext-fluent extension to facilitate multilingual inputs in my CKAN instance. During the usage, I encountered an issue where non-ASCII characters (like "ä", "ö", "ü", etc.) are being stored as Unicode escaped strings in the database. This is happening because the json.dumps method in the fluent_text validator is encoding these characters to their Unicode escape sequences.
For instance, a text like:
Stromtarif Tarifanteil KEV Standardprodukt gemäss ElCom pro Kategorie
is being stored in the database as:
Stromtarif Tarifanteil KEV Standardprodukt gem\u00e4ss ElCom pro Kategorie
Currently, the relevant part of the code in the fluent_text validator looks like this:
data[key] = json.dumps(value)
and
data[key] = json.dumps(output)
This issue not only affects the way data is stored but also adversely impacts the search functionality in CKAN, as the SOLR search engine fails to match these Unicode escaped sequences with the actual characters in search queries.
To resolve this, I propose updating the above lines to:
This modification will ensure that non-ASCII characters are stored as they are, without being converted to their Unicode escape sequences, thus preserving the original characters and facilitating accurate search results.
Moreover, I noticed that other extensions use a validator called "unicode_safe" to handle non-ASCII characters gracefully. I tried using this validator but it seems that the fluent_text validator does not recognize it. Therefore, it would be greatly beneficial if the fluent_text validator could be updated to integrate or recognize the "unicode_safe" validator to allow for the proper handling of non-ASCII characters.
I look forward to hearing your thoughts on this and would greatly appreciate any guidance or support in this regard.
Hello,
I have been using the
ckanext-fluent
extension to facilitate multilingual inputs in my CKAN instance. During the usage, I encountered an issue where non-ASCII characters (like "ä", "ö", "ü", etc.) are being stored as Unicode escaped strings in the database. This is happening because thejson.dumps
method in thefluent_text
validator is encoding these characters to their Unicode escape sequences.For instance, a text like:
is being stored in the database as:
Currently, the relevant part of the code in the
fluent_text
validator looks like this:and
This issue not only affects the way data is stored but also adversely impacts the search functionality in CKAN, as the SOLR search engine fails to match these Unicode escaped sequences with the actual characters in search queries.
To resolve this, I propose updating the above lines to:
and
This modification will ensure that non-ASCII characters are stored as they are, without being converted to their Unicode escape sequences, thus preserving the original characters and facilitating accurate search results.
Moreover, I noticed that other extensions use a validator called
"unicode_safe"
to handle non-ASCII characters gracefully. I tried using this validator but it seems that thefluent_text
validator does not recognize it. Therefore, it would be greatly beneficial if thefluent_text
validator could be updated to integrate or recognize the"unicode_safe"
validator to allow for the proper handling of non-ASCII characters.I look forward to hearing your thoughts on this and would greatly appreciate any guidance or support in this regard.
Thank you.