Open nigelmegitt opened 6 years ago
Using codecs.open
with errors='ignore'
doesn't fix the issue - it still sometimes arises. Need to do more digging into the content that triggers it and trace back to the source of the error. It could be something to do with a specific feed and the way that is made.
Needs to be re-reviewed in the context of Python3, where the issue may no longer arise.
The exception seems to be caused by characters which occupy more than just a single byte in UTF-8 i.e. characters with a Unicode code point > 127 (= not from the lower half of ASCII). For example also the German umlauts äöüÄÖÜ
and the "sharp s" ß
- I'm affected, too.
With codecs.open
and encoding='utf-8'
(taken from Python 2's Unicode HOWTO), tested with the filesystem
output, the exception doesn't occur.
Sounds promising @spoeschel , does this mean you can generate a test case? That would be great because even if we fix it for Python2, we will also need to check it still works in Python3 when we migrate.
@spoeschel I made a comment in #484 a long long time ago suggesting this was worth re-testing in Python3. I don't know if Python3 would work for you, but I've pushed a working Python3 build to the release/3.0
branch; if you have a repeatable test case would you be interested in trying that branch and seeing if this bug is indeed resolved by moving to Python3?
I havent't yet worked into the testing subsystem, but I will create a test case for this.
Testing with the Python 3 branch this issue indeed no longer occurs when using one of the German letters mentioned above.
However I get an exception when using the WebSocket output with the Python 3 branch (the WS input works), regardless of using any of the problematic letters or not. The filesystem output works though. I will have a look into that and probably open a new issue.
Thank you @spoeschel !
With
codecs.open
andencoding='utf-8'
(taken from Python 2's Unicode HOWTO), tested with thefilesystem
output, the exception doesn't occur.
It just turned out that this quick fix for the Python 2 branch only worked when I used the Resequencer. With the buffer-delay
, the exception still occurs though the UTF-8 encoding is set for writing the output file. So it seems that the processing of the Resequencer somehow helps/sanitizes here - and the received documents cannot be forwarded to the output without such further processing, without triggering the exception. So it is maybe the easiest to go the Python 3 way here.
I think this is a strong argument for tying up the release/2.1.2 work, releasing it as our final Python2 release and moving all future work into release/3.0.
I agree; this makes more sense than fixing a complex issue for a Python version that will be deprecated very soon anyway.
Using the EBU-TT-D Encoder I'm occasionally getting Unicode errors like:
This is annoying. I don't know what's causing it, but there's probably an easy fix (though possibly a dangerous one) - https://docs.python.org/2.7/howto/unicode.html#the-unicode-type suggests using
codecs.open
and settingerrors='ignore'
will at least make the error go away...