Azure / azure-storage-cpp

Microsoft Azure Storage Client Library for C++
http://azure.github.io/azure-storage-cpp
Apache License 2.0
132 stars 147 forks source link

Question: blob name and container name max length limitation #291

Open yxiang92128 opened 5 years ago

yxiang92128 commented 5 years ago

From this doc: https://docs.microsoft.com/en-us/rest/api/storageservices/Naming-and-Referencing-Containers--Blobs--and-Metadata?redirectedfrom=MSDN

"A blob name must be at least one character long and cannot be more than 1,024 characters long, for blobs in Azure Storage."

Can you clarify what it means by "characters"? Because it seems like we could create a blob with 1024 wide characters (e.g. kanji or emoji), then when we use the SDK trying to access that blob with the blob name specified in 1024 wide characters, it will throw an exception:

with http code=<400> We wonder if it is a limitation on SDK side. Thanks, Yang
Jinming-Hu commented 5 years ago

Hi Yang,

The azure-storage-cpp SDK itself doesn't have any limitation on the length of blob name. It just encodes it and sends it out in HTTP request.

I did some test about the issue you mentioned. (The following conclusions are mostly based on speculation and possibly wrong and may change in the future.)

The emoji character 😊 is 4 bytes encoded in UTF-16, 4 bytes encoded in UTF-8. As far as I've tested, a blob name can hold at most 512 😊. That is 1024 wide chars. More than that results in 400 bad request.

The Chinese character 阿 is two bytes encoded in UTF-16, 3 bytes encoded in UTF-8. A blob name can hold at most 1024 阿. That is also 1024 wide chars.

So I guess in the doc you quoted, A character means UTF-16 encoded 2-byte character.

Since you are working on Linux, where the default encoding is usually UTF-8, the length of a string(either in bytes or in characters) is usually not the same as encoded in UTF-16. Plus, blob name always appears in URL. It will be encoded if there are any non-ascii characters, making the URL even longer. Some browsers may not be able to handle it if it exceeds some limit.

yxiang92128 commented 5 years ago

That actually makes perfect sense. Thanks.

yxiang92128 commented 5 years ago

@JinmingHu-MSFT There seems to be a subtle issue. See the code segment below: int j; objname[0] = 'A'; objname[1] = 'A'; objname[2] = 'A'; objname[3] = 'A'; for (j=4; j < 1059; j=j+4) { objname[j] = 0xf0; objname[j+1] = 0x9f; objname[j+2] = 0x98; objname[j+3] = 0x8a; objname[j+4] = '\0'; } objname[j] = 'A'; objname[j+1] = '\0';

azure::storage::cloud_block_blob block_blob = container.get_block_blob_reference(objname);

...
block_blob.upload_from_stream(m_istream, m_len, azure::storage::access_condition(), reqOptions,
    azure::storage::operation_context());

The code above actually worked and created a long objname with four letter "A" in front, 264 smiley face emoji in between and ended with a "A". It went beyond 1024 wchar limit! and I used az CLI to have verified the blob name. If I replace the emoji with letter 'A", then it will throw an "out of range" exception and no blob would be created.

Also if I use az CLI with that long name (emoji, ascii, or kanji) > 1024 wchar, it will generate the OutOfRange error as expected: One of the request inputs is out of range.ErrorCode: OutOfRangeInput <?xml version="1.0" encoding="utf-8"?>OutOfRangeInputOne of the request inputs is out of range. RequestId:d466f9b6-a01e-012b-576b-6ed45e000000

Any thoughts?

I can't figure out why.

Jinming-Hu commented 5 years ago

@yxiang92128 "AAAA" + 264 smiley face emoji + "A" is 533 wchars, isn't beyond 1024 limit. So it works as expected.

What do you mean by "replace the emoji with letter A"? Since emoji is 4 bytes in UTF-8, you mean replace 1 emoji with four As or only one A?

yxiang92128 commented 5 years ago

@JinmingHu-MSFT What I meant is the following objname would throw an OutOfRange exception and I wonder why, it is 1025 UTF8 I would assume so still below the 1024 UTF16 max: char objname[PATH_MAX]; for (j = 0; j < 1025; j++ ) objname[j] = 'A'; objname[j] = '\0';

I might be doing something wrong here but I couldn't figure out why yet.

Jinming-Hu commented 5 years ago

@yxiang92128 Character A is 1 byte encoded in UTF-8, but it takes two bytes encoded in UTF-16. Every character takes at lease two bytes encoded in UTF-16. It may sound weird to you, because for ascii chars, 1 byte is enough, the other byte is wasted. But this is just the way it is.

In your situation, objname is 1025 As. If encoded in UTF16, it's still 1025 characters, but takes 2050 bytes. It's beyond 1024-character limitation.

Jinming-Hu commented 4 years ago

We're going to close this issue because of inactivity, feel free to reopen it if you have any further questions.