enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.36k stars 323 forks source link

Fallback to Windows-1252 encoding with `Encoding.Default` if invalid UTF-8 characters are encountered #10148

Closed radeusgd closed 5 months ago

radeusgd commented 5 months ago

The PR #10130 introduces Encoding.Default that has some basic heuristics for 'guessing' the text encoding. It has the ability to detect UTF-8 and UTF-16 LE and BE based on BOM. If no BOM is found, it falls back to UTF-8 which is a very common file encoding.

Still there are some files that are not valid UTF-8. The Windows-1252 was a very popular encoding, in which every byte has a 'valid' interpretation. Thus if we did not find a BOM and are just guessing the UTF-8 encoding, and we find invalid characters there, we may try to fallback to this encoding - it will always be able to decode into something. It may still be a bad guess and decode to random garbage characters - but it will always successfully decode to something.

Thus we want to modify the logic for Encoding.Default in such a way that if the UTF-8 is used as a fallback and invalid UTF-8 characters are encountered, we restart the parsing process with Windows-1252 encoding in which all characters will be able to be decoded.

This does not apply to any other encoding - if UTF encoding was explicitly chosen, we never change it. Moreover, if we have detected UTF-8 BOM in the 'automatic' mode, we also stay with UTF-8 and report the invalid characters - that is because while the UTF-8 BOM is still valid Windows-1252 characters - I think it is extremely unlikely for a valid Windows-1252 file to begin with characters  - and the user can always override the encoding explicitly.

enso-bot[bot] commented 5 months ago

Radosław Waśko reports a new STANDUP for yesterday (2024-06-03):

Progress: Fixing Delimited Write in old PR - overlooked before. Did changes requested in review. Started work on followup: Win1252 fallback. It should be finished by 2024-06-07.

Next Day: Next day I will be working on the same task. Continue work on the fallback logic - restarting the stream etc.

enso-bot[bot] commented 5 months ago

Radosław Waśko reports a new STANDUP for yesterday (2024-06-04):

Progress: Previous PR finally merged. Ongoing work on the fallback mechanism - moving parts of logic from Java to Enso to be able to use some more information about the stream's source. It should be finished by 2024-06-07.

Next Day: Next day I will be working on the same task. Continue work on the fallback logic

enso-bot[bot] commented 5 months ago

Radosław Waśko reports a new STANDUP for today (2024-06-05):

Progress: Implemented the fallback logic, tests are passing. Investigating weird stack overflow issue. It should be finished by 2024-06-07.

Next Day: Next day I will be working on the same task. Add tests for some added helpers. Test real use-case. Try creating a repro for and report StackOverflow bug. Try merging the Persistance / Types PRs.

enso-bot[bot] commented 5 months ago

Radosław Waśko reports a new STANDUP for the provided date (2024-06-07):

Progress: Created repro for the found bug. Added tests for helpers, clarifying the requirements I want from these. Tested a real world use case in the GUI with the fallback logic. Figured out a fix for the Persistance PR test failures. It should be finished by 2024-06-07.

Next Day: Next day I will be working on the #9980 task. Getting the pending PRs merged. Start work on next task.