enso-org / enso

Hybrid visual and textual functional programming.
https://enso.org
Apache License 2.0
7.31k stars 318 forks source link

Optimize Windows-1252 fallback logic in `Encoding.Default` to do only one pass on the happy path #10220

Open radeusgd opened 1 month ago

radeusgd commented 1 month ago

The PR #10190 adds logic for falling back to Windows-1252 encoding if invalid UTF-8 characters are encountered.

Current implementation

It is implemented by making 2 passes through the file:

  1. The first pass is decoding with UTF-8 to check if all characters are valid. This pass is discarding the data and only checking the validity. It exits early if invalid character is encountered.
  2. The first pass determined the effective encoding - either all characters were valid UTF-8 and we can use that, or we had to fall back to Windows-1252. Now the second pass knows the encoding and can actually decode the data and pass it further downstream, e.g. for parsing.

This has the nice simplicity that the downstream usage of the decoder is already working with a known encoding and it needs no rollback.

Possible new implementation

The disadvantage is that the file is always processed twice, even in the 'happy' case (valid UTF-8). If we assume that most of encountered files will be valid UTF-8 (or just ASCII which is a subset), then it seems wasteful to always perform two passes, discarding the results of the first one. It seems that it might be more efficient to 'optimistically' process the data from the first pass (assuming UTF-8) and only if that fails, restart and try processing again with Windows-1252.

To do so we'd need to modify how the processing is done. First of all we need to ensure that the processing does not have side-effects, so that it can be restarted in the middle if the fallback mechanism is encountered. Currently relevant processing logic is decoding to text (decode_bytes_to_text) and the Delimited parsing (Delimited_Reader).

The new strategy would work as follows:

  1. Capture the closure that wants to process the decoded data - it receives a reader and returns a result.
  2. First construct a reader assuming UTF-8 encoding, and run the closure with it.
  3. If the decoding succeeds - return the result.
  4. If the decoding fails at some point - throw a panic that will interrupt the execution of that closure and bring control back to the decoder code. Now, restart the stream and create a new decoder with Windows-1252 encoding, and call the closure again.

Edge cases

One edge case to keep in mind where the new and old behaviour could differ: what happens if the closure (user code) fails with some panic mid-decoding. Theoretically, the place where the fallback mechanism is triggered could be as far as 10000 bytes into the file. So if the decoding fails during the first 500 bytes, we won't ever 'reach' the fallback 'threshold'. This can cause different behaviour - if the user code worked fine on Windows-1252 and only failed in UTF-8 mode, it could be succeeding in current behaviour but failing with the new one.

If we want to keep the old behaviour, we could modify the step (2-3) from above in the following way: if the user code fails, we do not stop the decoding - instead we continue - ignoring the user code but just reading the file till the end to determine if the fallback mechanism should be triggered. If no invalid UTF-8 is encountered - we can propagate the earlier user-code error. If invalid character is encountered, we discard the user error and re-run the closure with the fallback - ensuring that an early error does not prevent us from falling into the fallback logic.

On the other hand, this may be overly complicated for very rare edge cases (I'm not sure if such situation is at all practically possible). So it may be better to keep the more efficient approach of just failing early.

Comparison of implementations