Support for non-ASCII Characters in fluent_text Validator

khashashin commented 1 year ago

Hello,

I have been using the ckanext-fluent extension to facilitate multilingual inputs in my CKAN instance. During the usage, I encountered an issue where non-ASCII characters (like "ä", "ö", "ü", etc.) are being stored as Unicode escaped strings in the database. This is happening because the json.dumps method in the fluent_text validator is encoding these characters to their Unicode escape sequences.

For instance, a text like:

Stromtarif Tarifanteil KEV Standardprodukt gemäss ElCom pro Kategorie

is being stored in the database as:

Stromtarif Tarifanteil KEV Standardprodukt gem\u00e4ss ElCom pro Kategorie

Currently, the relevant part of the code in the fluent_text validator looks like this:

data[key] = json.dumps(value)

and

data[key] = json.dumps(output)

This issue not only affects the way data is stored but also adversely impacts the search functionality in CKAN, as the SOLR search engine fails to match these Unicode escaped sequences with the actual characters in search queries.

To resolve this, I propose updating the above lines to:

data[key] = json.dumps(value, ensure_ascii=False)

and

data[key] = json.dumps(output, ensure_ascii=False)

This modification will ensure that non-ASCII characters are stored as they are, without being converted to their Unicode escape sequences, thus preserving the original characters and facilitating accurate search results.

Moreover, I noticed that other extensions use a validator called "unicode_safe" to handle non-ASCII characters gracefully. I tried using this validator but it seems that the fluent_text validator does not recognize it. Therefore, it would be greatly beneficial if the fluent_text validator could be updated to integrate or recognize the "unicode_safe" validator to allow for the proper handling of non-ASCII characters.

I look forward to hearing your thoughts on this and would greatly appreciate any guidance or support in this regard.

Thank you.

wardi commented 1 year ago

Sure I'd accept a PR that makes these changes.

khashashin commented 6 months ago

fixed in https://github.com/ckan/ckanext-fluent/pull/50

ckan / ckanext-fluent

Support for non-ASCII Characters in fluent_text Validator #47