denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

Unicode support in all ddr tools #101

Open gjost opened 5 years ago

gjost commented 5 years ago

Be able to ingest/index unicode, in particular, Japanese language

gjost commented 5 years ago

We can definitely work on this now; Python3 is Unicode-only so we'll have to solve it as part of that process.

We need to think about how to handle non-Unicode text in two situations:

In the UI I'd like for forms to identify fields with non-Unicode text as errors, and display the offending code along with context so the user can fix it. This UI would apply to new text entered in forms as well. Display templates should prominently flag bad text and invite the user to fix it.

ddrimport should flag non-Unicode text as errors that must be fixed before records can be imported.

Awhile back I put together fileio.read_text and .write_text functions that were designed to be able to work with Unicode. The plan was to have all code that reads or writes use those two functions, but most code doesn't yet. My idea was that fileio.read_text would have several modes. In strict mode it would simple raise an exception for bad text. In permissive mode it would return text with bad chars marked, along with the original text. This would allow higher-level code to display the raw code if necessary.

We'll need a script that goes through all the text in the system and finds bad text.

Ultimately when we get to Python3 no non-Unicode text should even enter the system.