Azure / azure-cli

Azure Command-Line Interface
MIT License
3.99k stars 2.98k forks source link

Azure Storage Entity Query (CLI) Crashes for Binary Data #8021

Closed Andrei15193 closed 5 years ago

Andrei15193 commented 5 years ago

Describe the bug When querying entities that have binary data properties (Edm.Binary type) the command line interface crashes for some values.

To Reproduce

  1. Create a test table
  2. Insert an entity containing all possible byte values
  3. Query the table

Expected behavior The command is expected to run and the data to be returned encoded in Base64 for binary properties. The reason I expect Base64 is because I need to pass the value in the same encoding when inserting the entity.

Environment summary Install method: MSI (for Windows) CLI version: azure-cli (2.0.52) OS: Windows 10 Shell Type: PowerShell

Additional context Below is the script I used to test how binary data is stored (the documentation is not clear, I guessed it required Base64 from the JSON payload sample on Payload Format for Table Service Operations - JSON Format)

$connectionString = "storage emulator or actual storage account connection string"
$tableName = "BinaryDataTestTable"

$binaryData = 0..[byte]::MaxValue # inserting 0..3 values works

# Ensure test table exists
az storage table create,
    --name $tableName,
    --connection-string $connectionString

# Ensure the test entity exists
# Make sure the property type is specified otherwise it is Edm.String
az storage entity insert,
    --table-name $tableName,
    --entity,
        "PartitionKey=partitionKey",
        "RowKey=RowKey",
        "BinaryProperty=$([Convert]::ToBase64String($binaryData))",
        "BinaryProperty@odata.type=Edm.Binary",
    --if-exists replace,
    --connection-string $connectionString

# Query the table, should work but it fails for binary data
az storage entity query,
    --table-name $tableName,
    --connection-string $connectionString
Andrei15193 commented 5 years ago

I have checked the REST API for binary data and the response uses Base64 strings as well. It's something with the command line tool.

williexu commented 5 years ago

@Andrei15193 what error are you seeing?

Andrei15193 commented 5 years ago

@williexu here's the full error message I get:

'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte
Traceback (most recent call last):
  File "C:\Users\VSSADM~1\AppData\Local\Temp\pip-install-e5ce5s7r\knack\knack\cli.py", line 212, in invoke
  File "C:\Users\VSSADM~1\AppData\Local\Temp\pip-install-e5ce5s7r\knack\knack\output.py", line 132, in out
  File "C:\Users\VSSADM~1\AppData\Local\Temp\pip-install-e5ce5s7r\knack\knack\output.py", line 40, in format_json
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\encoder.py", line 201, in encode
    chunks = list(chunks)
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\encoder.py", line 430, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\lib\json\encoder.py", line 437, in _iterencode
    o = _default(o)
  File "C:\Users\VSSADM~1\AppData\Local\Temp\pip-install-e5ce5s7r\knack\knack\output.py", line 31, in default
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte
williexu commented 5 years ago

@Andrei15193 I was able to repro your error.

@zezha-msft the binary info returned using the script above can't be decoded using utf-8. From what I've tried, some base64 strings can be decoded properly, while others cannot. Is this behavior expected and if so, how should the data be consumed?

zezha-msft commented 5 years ago

@williexu I do not maintain the table SDK.

Adding @srinathnarayanan to help you with this issue.

williexu commented 5 years ago

@srinathnarayanan ping

Andrei15193 commented 5 years ago

Hey @williexu, @srinathnarayanan, are there any updates on this?

Andrei15193 commented 5 years ago

@williexu looks like there's not much focus on this, I figure you guys are busy with other stuff as well, it's quite an extensive SDK. Is there a guideline for contributing? I may be able to find the source of the issue and fix it, but for that I need a confirmation whether the result for binary data should indeed be a Base64 string, if not then what should be expected?

williexu commented 5 years ago

@Andrei15193, since the entity insert for a base64 string succeeds, I think the response should indeed also be a base64 string as well.

The table sdk is not giving back data in the expected format, and the CLI is unable to decode it properly. I've tried to get ahold of the sdk owner @srinathnarayanan but I believe he is on vacation at the moment. If you are able to figure out the source of the issue and open a PR on the right repo, that would be great.

Andrei15193 commented 5 years ago

@williexu I have done some digging and found the source(s) of the bug. There are multiple areas where we can intervene to fix this bug.

Starting from the response from the server, the request goes through azure-multiapi-storage-python, more specifically at _deserialization.py:117 an entity is being desirialized, at _deserialization.py:142 the property types from OData are being set and later on a property converter is looked up at _deserialization.py:193. For binary and encrypted properties the convertor is defined at _deserialization.py:69 which ultimately leads to _common_conversion.py:87 which explicitly converts Base64 encoded stings to byte strings which contain the actual raw binary data. This is one area where I can make a fix, but it might have effects in other areas where the library is being used.

A second area where a fix can be made is in knack, when an object is being written to the output it ultimately gets to output.py:113. In case of the CLI the default JSON formatter is being used, byte strings are treated separately at output.py:30 and judging by the existing tests test_output.py:61 it is expected for the byte string to be encoded in Base64 and not have the raw binary data. A change here may have effects for other tools or libraries.

A third place where I can make a fix is in the CLI itself, we can override the JSON formatters at _output.py:12 but this will mean mostly a copy/paste solution (copy from knack and change the bit that causes the bug, i.e.: encode back to Base64 string).

There might be other possible solutions, but I think these illustrate the nature of the issue. From the first and second findings it looks like it is intended behaviour and there's a discrepancy between what is expected and what is actually provided.

Where do you think the fix should be done?

williexu commented 5 years ago

@Andrei15193 can you do this fix through a transformer? i.e. example. This will transform the result from the sdk before it arrives at the knack out() method, and will not have any effects on the rest of the commands.

Andrei15193 commented 5 years ago

@williexu yep, I think it's possible. I'll have to check the --query option as well. Not sure when it is applied, if it's after the transformer it's good, otherwise the issue may still happen if you change the structure of the response. I'll check and let you know.