[C++] Filesystem implementation for Azure Blob Storage

asfimport commented 6 years ago

Subissues:

GH-35903
GH-36886
GH-37511
GH-38330
GH-38333
GH-38335
GH-38597
GH-38598
- GH-39262
- GH-39318
- GH-39320
- GH-39343
- GH-39344
- GH-39345
- GH-39449
GH-38699
GH-38700
GH-38701
GH-38702
GH-38703
GH-40074
GH-38704
GH-38705
GH-38758
GH-38772
GH-38999
GH-39069
- GH-41034
GH-39119
GH-39292
GH-39297
GH-40025
GH-40026
GH-40028
GH-40035
GH-40036
GH-40037
GH-40052
GH-40057
...

Reporter: Wes McKinney / @wesm Assignee: Shefali Singh

Related issues:

[C++] Filesystem implementation for Azure Data Lake (is cloned by)
[R] Expose Azure Blob Storage filesystem (is depended upon by)
Original Issue Attachments:
azfs.cc
azfs.h
PRs and other links:
GitHub Pull Request #12914

_{Note: This issue was originally created as ARROW-2034. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Wes McKinney / @wesm: I see that TileDB (MIT license) has built a C++ wrapper for Azure

https://github.com/TileDB-Inc/TileDB/blob/dev/tiledb/sm/filesystem/azure.cc

No this is not moved to fsspec, this is a C++ ticket

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: I'm not sure I understand the relationship between Blob Store and Data Lake. Is Data Lake a higher-level layer above Blob Store? Or are they two different services that would need separate filesystem implementations?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: According to https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction ,

Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster recovery capabilities. I'm not sure this means the same C++ API can be used to access both, though.

asfimport commented 3 years ago

Uwe Korn / @xhochy: It is as confusing in reality, here is what they all are (I'm though already 1 year outdated on this):

Blob Store: Like S3, simple but limited API
Data Lake Gen 1: HDFS-like deployment with different but more user-friendly API / attributes
Data Lake Gen 2: Some improvements were made to the Blob Store so that there is no need for a special (more expensive) Data Lake service anymore, everything is now on the Blob Store. A new set of APIs was though released that exposes some nice features that the initial Blob Store API didn't have, probably for marketing purposes this was named Data Lake Gen 2 although technically Blob Store Gen 2 would have been more appropriate.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Ha, and here are some interesting resources:

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues : "Blob APIs and Data Lake Storage Gen2 APIs can operate on the same data." but "You cannot use blob API and Data Lake Storage APIs to write to the same instance of a file" (sounds ok?)
https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs : three different kinds of blobs. "Block blobs" sound most useful, though "append blobs" may work too.
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace : a Data Lake admin can configure a it with a hierarchical filesystem semantics. Apparently it may enable different APIs (??).

asfimport commented 3 years ago

Yesh: There is also https://github.com/Azure/azure-sdk-for-cpp which I’ve tested against adl gen2 .

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Various pointers about the Azure C++ SDK:

https://devblogs.microsoft.com/azure-sdk/cppintro/#example-code-using-the-c-storage-blob-client-libraryhttps://devblogs.microsoft.com/azure-sdk/cppintro/

https://github.com/Azure/azure-sdk-for-cpp

https://github.com/Azure/azure-sdk-for-cpp/blob/master/sdk/storage/azure-storage-blobs/sample/blob_getting_started.cpp

https://azure.github.io/azure-sdk-for-cpp/ (note the API docs are segregated per component, with separate land pages for Core, Storage, Blog Storage, etc.)

Note that the SDK requires C++14.