etsy / boundary-layer

Builds Airflow DAGs from configuration files. Powers all DAGs on the Etsy Data Platform
Apache License 2.0
262 stars 58 forks source link

[bugfix] Fix handling of bytes #124

Open brighton1101 opened 3 years ago

brighton1101 commented 3 years ago

Closes #79

One of our operators has an optional field that requires bytes. If that field is populated, it will error out when it gets formatted. This should fix that.

vchiapaikeo commented 3 years ago

@brighton1101 , won't the leading b cause issues with decoding?

vchiapaikeo commented 3 years ago

I think this will turn it into a string and we don't have to worry about bytes after that?

https://github.com/etsy/boundary-layer/pull/125/files

brighton1101 commented 3 years ago

@vchiapaikeo I think that hack is alright, but the leading b means that it is a byte literal. It has nothing to do with encoding/decoding by design, since the contents are only bytes. Python docs.

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Afaik Calling str on a byte literal just outputs the byte literal string like below:

>>> tm = "™".encode("utf-8")
>>> print(tm)
b'\xe2\x84\xa2'
>>> print(str(tm)) # here is what we'd be doing
b'\xe2\x84\xa2'
>>> str(tm)
"b'\\xe2\\x84\\xa2'"
>>> tm.decode("utf-8")
'™'

That operator is expecting bytes. I feel like it is a regression for boundary-layer to pass the incorrect parameter for a type just because it currently works.

brighton1101 commented 3 years ago

Also - for context this likely wasn't in boundary layer before because Python 2.x did not support byte literals. Since we don't support python 2 anymore, we don't have to worry about this.

vchiapaikeo commented 3 years ago

I think the problem here is that if you call the str function around a bytes type, it will fail to b64 decode later on. Example:

>>> encoded_bytes = base64.b64encode('{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1}'.encode('utf-8'))
>>> 
>>> encoded_bytes
b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>> 
>>> str(encoded_bytes)
"b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'"
>>> 
>>> # Now try to decode this
... 
>>> 
>>> base64.b64decode(str(encoded_bytes))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Alternatively:

>>> encoded_bytes = base64.b64encode('{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1, "dataflow_service_account": "dataflow-dev-pii@etsy-hadoop-sandbox-dev.iam.gserviceaccount.com", "bq_project": "etsy-data-warehouse-prod"}'.encode('utf-8'))
>>> 
>>> encoded_bytes
b'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>> 
>>> encoded_bytes.decode('utf-8')
'eyJuYW1lIjogImFkbWluX3RocmVhZHMiLCAidmVyc2lvbiI6IDEsICJzZW5kX3RvX2JpZ3F1ZXJ5IjogMSwgImV4cG9ydF9mcm9tX2JpZ3F1ZXJ5IjogMSwgImNvcHlfZnJvbV9iaWdxdWVyeSI6IDEsICJkYXRhZmxvd19zZXJ2aWNlX2FjY291bnQiOiAiZGF0YWZsb3ctZGV2LXBpaUBldHN5LWhhZG9vcC1zYW5kYm94LWRldi5pYW0uZ3NlcnZpY2VhY2NvdW50LmNvbSIsICJicV9wcm9qZWN0IjogImV0c3ktZGF0YS13YXJlaG91c2UtcHJvZCJ9'
>>> 
>>> decoded_bytes = encoded_bytes.decode('utf-8')
>>> 
>>> # This will decode properly
... 
>>> base64.b64decode(decoded_bytes)
b'{"name": "admin_threads", "version": 1, "send_to_bigquery": 1, "export_from_bigquery": 1, "copy_from_bigquery": 1}'
vchiapaikeo commented 3 years ago

Ah I think your code snippet only works this way in Py2 and not Py3

>>> tm = "™".encode("utf-8")
>>> print(tm)
b'\xe2\x84\xa2'
>>> print(str(tm)) # here is what we'd be doing
b'\xe2\x84\xa2'
>>> str(tm)
"b'\\xe2\\x84\\xa2'"
>>> tm.decode("utf-8")
'™'