Occasional encoding errors

nigelmegitt commented 6 years ago

Using the EBU-TT-D Encoder I'm occasionally getting Unicode errors like:

Unhandled Error
Traceback (most recent call last):
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/log.py", line 103, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/log.py", line 86, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/internet/selectreactor.py", line 149, in _doReadOrWrite
    why = getattr(selectable, method)()
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/internet/tcp.py", line 208, in doRead
    return self._dataReceived(data)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/twisted/internet/tcp.py", line 214, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/twisted/websocket.py", line 131, in dataReceived
    self._dataReceived(data)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1175, in _dataReceived
    self.consumeData()
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1187, in consumeData
    while self.processData() and self.state != WebSocketProtocol.STATE_CLOSED:
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1553, in processData
    fr = self.onFrameEnd()
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 1674, in onFrameEnd
    self._onMessageEnd()
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/twisted/websocket.py", line 159, in _onMessageEnd
    self.onMessageEnd()
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/websocket/protocol.py", line 627, in onMessageEnd
    self._onMessage(payload, self.message_is_binary)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/venv/lib/python2.7/site-packages/autobahn/twisted/websocket.py", line 162, in _onMessage
    self.onMessage(payload, isBinary)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/twisted/websocket.py", line 362, in onMessage
    self._write_to_consumer(payload, sequence_identifier=self._sequence_identifier)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/twisted/websocket.py", line 111, in _write_to_consumer
    self.consumer.write(data, **kwargs)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/twisted/websocket.py", line 208, in write
    self._custom_consumer.on_new_data(data, **kwargs)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/carriage/websocket.py", line 32, in on_new_data
    self.consumer_node.process_document(data, **kwargs)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/adapters/node_carriage.py", line 174, in process_document
    self.consumer_node.process_document(conv_doc, **new_kwargs)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/node/encoder.py", line 48, in process_document
    self.producer_carriage.emit_data(data=converted_doc, sequence_identifier='default', time_base='media', **kwargs)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/adapters/node_carriage.py", line 116, in emit_data
    self.producer_carriage.emit_data(conv_data, **new_kwargs)
  File "/Users/megitn02/Code/ebu/ebu-tt-live-toolkit/ebu_tt_live/carriage/filesystem.py", line 158, in emit_data
    destfile.write(data)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 1523: ordinal not in range(128)

This is annoying. I don't know what's causing it, but there's probably an easy fix (though possibly a dangerous one) - https://docs.python.org/2.7/howto/unicode.html#the-unicode-type suggests using codecs.open and setting errors='ignore' will at least make the error go away...

nigelmegitt commented 6 years ago

Using codecs.open with errors='ignore' doesn't fix the issue - it still sometimes arises. Need to do more digging into the content that triggers it and trace back to the source of the error. It could be something to do with a specific feed and the way that is made.

nigelmegitt commented 5 years ago

Needs to be re-reviewed in the context of Python3, where the issue may no longer arise.

spoeschel commented 4 years ago

The exception seems to be caused by characters which occupy more than just a single byte in UTF-8 i.e. characters with a Unicode code point > 127 (= not from the lower half of ASCII). For example also the German umlauts äöüÄÖÜ and the "sharp s" ß - I'm affected, too.

With codecs.open and encoding='utf-8' (taken from Python 2's Unicode HOWTO), tested with the filesystem output, the exception doesn't occur.

nigelmegitt commented 4 years ago

Sounds promising @spoeschel , does this mean you can generate a test case? That would be great because even if we fix it for Python2, we will also need to check it still works in Python3 when we migrate.

nigelmegitt commented 4 years ago

@spoeschel I made a comment in #484 a long long time ago suggesting this was worth re-testing in Python3. I don't know if Python3 would work for you, but I've pushed a working Python3 build to the release/3.0 branch; if you have a repeatable test case would you be interested in trying that branch and seeing if this bug is indeed resolved by moving to Python3?

spoeschel commented 4 years ago

I havent't yet worked into the testing subsystem, but I will create a test case for this.

Testing with the Python 3 branch this issue indeed no longer occurs when using one of the German letters mentioned above.

However I get an exception when using the WebSocket output with the Python 3 branch (the WS input works), regardless of using any of the problematic letters or not. The filesystem output works though. I will have a look into that and probably open a new issue.

nigelmegitt commented 4 years ago

Thank you @spoeschel !

spoeschel commented 4 years ago

With codecs.open and encoding='utf-8' (taken from Python 2's Unicode HOWTO), tested with the filesystem output, the exception doesn't occur.

It just turned out that this quick fix for the Python 2 branch only worked when I used the Resequencer. With the buffer-delay, the exception still occurs though the UTF-8 encoding is set for writing the output file. So it seems that the processing of the Resequencer somehow helps/sanitizes here - and the received documents cannot be forwarded to the output without such further processing, without triggering the exception. So it is maybe the easiest to go the Python 3 way here.

nigelmegitt commented 4 years ago

I think this is a strong argument for tying up the release/2.1.2 work, releasing it as our final Python2 release and moving all future work into release/3.0.

spoeschel commented 4 years ago

I agree; this makes more sense than fixing a complex issue for a Python version that will be deprecated very soon anyway.

ebu / ebu-tt-live-toolkit

Occasional encoding errors #483