DigitalCommons / open-data

0 stars 0 forks source link

DotCoop data has mojibake text #98

Open wu-lee opened 2 years ago

wu-lee commented 2 years ago

I'm seeing mojibake text like this in incoming CSV data: "Alliance Coopératives Cameroun", which propagates into the map dialogs. This not the only case. Something upstream is mangling the encoding.

ColmMassey commented 2 years ago

I'm seeing mojibake text like this in incoming CSV data: "Alliance Coopératives Cameroun", which propagates into the map dialogs. This not the only case. Something upstream is mangling the encoding.

WHere is the best place to catch that?

wu-lee commented 2 years ago

Ideally they give us non-mangled data, so we don't have to, but as you say this might not happen very quickly. We might demangle it ourselves, otherwise, but it's not as simple as you might think.

Searching, I see there's a mojibake decoder here, which can helpfully identify how to decode the case I spotted.

https://www.linestarve.com/tools/mojibake/?mojibake=Alliance+Coop%C3%83%C2%A9ratives+Cameroun&unescape_html=auto&remove_terminal_escapes=True&fix_encoding=True&restore_byte_a0=True&replace_lossy_sequences=True&decode_inconsistent_utf8=True&fix_c1_controls=True&fix_latin_ligatures=True&fix_character_width=True&uncurl_quotes=True&fix_surrogates=True&remove_control_chars=True&normalization=NFC

However, some experiments show it's not simply a matter of reading the file with the right encoding - there are actually extra bytes injected, so I get mojibake in any case. It got very fiddly and I gave up for the moment.

ColmDC commented 1 year ago

Is this being addressed in new system? @wu-lee

wu-lee commented 1 year ago

Nope.

Currently data is trusted throughout Mykomaps and the sausage factory. Mykomaps should have some protection in any case, but maybe the best place to gatekeep this is when data comes in from 3rd party sources?