The PR #10190 adds logic for falling back to Windows-1252 encoding if invalid UTF-8 characters are encountered.
Current implementation
It is implemented by making 2 passes through the file:
The first pass is decoding with UTF-8 to check if all characters are valid. This pass is discarding the data and only checking the validity. It exits early if invalid character is encountered.
The first pass determined the effective encoding - either all characters were valid UTF-8 and we can use that, or we had to fall back to Windows-1252. Now the second pass knows the encoding and can actually decode the data and pass it further downstream, e.g. for parsing.
This has the nice simplicity that the downstream usage of the decoder is already working with a known encoding and it needs no rollback.
Possible new implementation
The disadvantage is that the file is always processed twice, even in the 'happy' case (valid UTF-8).
If we assume that most of encountered files will be valid UTF-8 (or just ASCII which is a subset), then it seems wasteful to always perform two passes, discarding the results of the first one. It seems that it might be more efficient to 'optimistically' process the data from the first pass (assuming UTF-8) and only if that fails, restart and try processing again with Windows-1252.
To do so we'd need to modify how the processing is done. First of all we need to ensure that the processing does not have side-effects, so that it can be restarted in the middle if the fallback mechanism is encountered. Currently relevant processing logic is decoding to text (decode_bytes_to_text) and the Delimited parsing (Delimited_Reader).
The new strategy would work as follows:
Capture the closure that wants to process the decoded data - it receives a reader and returns a result.
First construct a reader assuming UTF-8 encoding, and run the closure with it.
If the decoding succeeds - return the result.
If the decoding fails at some point - throw a panic that will interrupt the execution of that closure and bring control back to the decoder code. Now, restart the stream and create a new decoder with Windows-1252 encoding, and call the closure again.
Edge cases
One edge case to keep in mind where the new and old behaviour could differ: what happens if the closure (user code) fails with some panic mid-decoding. Theoretically, the place where the fallback mechanism is triggered could be as far as 10000 bytes into the file. So if the decoding fails during the first 500 bytes, we won't ever 'reach' the fallback 'threshold'. This can cause different behaviour - if the user code worked fine on Windows-1252 and only failed in UTF-8 mode, it could be succeeding in current behaviour but failing with the new one.
If we want to keep the old behaviour, we could modify the step (2-3) from above in the following way: if the user code fails, we do not stop the decoding - instead we continue - ignoring the user code but just reading the file till the end to determine if the fallback mechanism should be triggered. If no invalid UTF-8 is encountered - we can propagate the earlier user-code error. If invalid character is encountered, we discard the user error and re-run the closure with the fallback - ensuring that an early error does not prevent us from falling into the fallback logic.
On the other hand, this may be overly complicated for very rare edge cases (I'm not sure if such situation is at all practically possible). So it may be better to keep the more efficient approach of just failing early.
Comparison of implementations
The new approach can propagate errors that could not have happened if fallback path was chosen first (see Edge cases).
The new approach is more efficient in the UTF-8 path - it goes through the file only once.
The new approach will be less efficient in the Win-1252 fallback path - it will run the user code twice - first for the prefix of the file that was valid UTF and then again for the whole file after restarting. If the parsing logic is relatively complicated (e.g. Delimited parser) that can cause significant overhead.
So the two have different trade-offs - no one is definitely more efficient (unless we see that the disk overhead is MUCH larger than the CSV parser overhead), it depends on whether we expect more valid UTF-8 files or Windows-1252 files.
The PR #10190 adds logic for falling back to Windows-1252 encoding if invalid UTF-8 characters are encountered.
Current implementation
It is implemented by making 2 passes through the file:
This has the nice simplicity that the downstream usage of the decoder is already working with a known encoding and it needs no rollback.
Possible new implementation
The disadvantage is that the file is always processed twice, even in the 'happy' case (valid UTF-8). If we assume that most of encountered files will be valid UTF-8 (or just ASCII which is a subset), then it seems wasteful to always perform two passes, discarding the results of the first one. It seems that it might be more efficient to 'optimistically' process the data from the first pass (assuming UTF-8) and only if that fails, restart and try processing again with Windows-1252.
To do so we'd need to modify how the processing is done. First of all we need to ensure that the processing does not have side-effects, so that it can be restarted in the middle if the fallback mechanism is encountered. Currently relevant processing logic is decoding to text (
decode_bytes_to_text
) and the Delimited parsing (Delimited_Reader
).The new strategy would work as follows:
Edge cases
One edge case to keep in mind where the new and old behaviour could differ: what happens if the closure (user code) fails with some panic mid-decoding. Theoretically, the place where the fallback mechanism is triggered could be as far as 10000 bytes into the file. So if the decoding fails during the first 500 bytes, we won't ever 'reach' the fallback 'threshold'. This can cause different behaviour - if the user code worked fine on Windows-1252 and only failed in UTF-8 mode, it could be succeeding in current behaviour but failing with the new one.
If we want to keep the old behaviour, we could modify the step (2-3) from above in the following way: if the user code fails, we do not stop the decoding - instead we continue - ignoring the user code but just reading the file till the end to determine if the fallback mechanism should be triggered. If no invalid UTF-8 is encountered - we can propagate the earlier user-code error. If invalid character is encountered, we discard the user error and re-run the closure with the fallback - ensuring that an early error does not prevent us from falling into the fallback logic.
On the other hand, this may be overly complicated for very rare edge cases (I'm not sure if such situation is at all practically possible). So it may be better to keep the more efficient approach of just failing early.
Comparison of implementations