Open brighton1101 opened 3 years ago
@brighton1101 , won't the leading b cause issues with decoding?
I think this will turn it into a string and we don't have to worry about bytes after that?
@vchiapaikeo I think that hack is alright, but the leading b means that it is a byte literal. It has nothing to do with encoding/decoding by design, since the contents are only bytes. Python docs.
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
Afaik Calling str
on a byte literal just outputs the byte literal string like below:
>>> tm = "™".encode("utf-8")
>>> print(tm)
b'\xe2\x84\xa2'
>>> print(str(tm)) # here is what we'd be doing
b'\xe2\x84\xa2'
>>> str(tm)
"b'\\xe2\\x84\\xa2'"
>>> tm.decode("utf-8")
'™'
That operator is expecting bytes. I feel like it is a regression for boundary-layer
to pass the incorrect parameter for a type just because it currently works.
Also - for context this likely wasn't in boundary layer before because Python 2.x did not support byte literals. Since we don't support python 2 anymore, we don't have to worry about this.
I think the problem here is that if you call the str function around a bytes type, it will fail to b64 decode later on. Example:
>>> encoded_bytes = base64.b64encode('{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1}'.encode('utf-8'))
>>>
>>> encoded_bytes
b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>>
>>> str(encoded_bytes)
"b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'"
>>>
>>> # Now try to decode this
...
>>>
>>> base64.b64decode(str(encoded_bytes))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.6/base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
Alternatively:
>>> encoded_bytes = base64.b64encode('{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1, "dataflow_service_account": "dataflow-dev-pii@etsy-hadoop-sandbox-dev.iam.gserviceaccount.com", "bq_project": "etsy-data-warehouse-prod"}'.encode('utf-8'))
>>>
>>> encoded_bytes
b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>>
>>> encoded_bytes.decode('utf-8')
'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>>
>>> decoded_bytes = encoded_bytes.decode('utf-8')
>>>
>>> # This will decode properly
...
>>> base64.b64decode(decoded_bytes)
b'{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1}'
Ah I think your code snippet only works this way in Py2 and not Py3
>>> tm = "™".encode("utf-8")
>>> print(tm)
b'\xe2\x84\xa2'
>>> print(str(tm)) # here is what we'd be doing
b'\xe2\x84\xa2'
>>> str(tm)
"b'\\xe2\\x84\\xa2'"
>>> tm.decode("utf-8")
'™'
Closes #79
One of our operators has an optional field that requires bytes. If that field is populated, it will error out when it gets formatted. This should fix that.