apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[C++] Filesystem implementation for Azure Blob Storage #18014

Closed asfimport closed 7 months ago

asfimport commented 6 years ago

Subissues:


Reporter: Wes McKinney / @wesm Assignee: Shefali Singh

Related issues:

Note: This issue was originally created as ARROW-2034. Please see the migration documentation for further details.

asfimport commented 4 years ago

Wes McKinney / @wesm: I see that TileDB (MIT license) has built a C++ wrapper for Azure

https://github.com/TileDB-Inc/TileDB/blob/dev/tiledb/sm/filesystem/azure.cc

No this is not moved to fsspec, this is a C++ ticket

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: I'm not sure I understand the relationship between Blob Store and Data Lake. Is Data Lake a higher-level layer above Blob Store? Or are they two different services that would need separate filesystem implementations?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: According to https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction ,

Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster recovery capabilities. I'm not sure this means the same C++ API can be used to access both, though.

asfimport commented 3 years ago

Uwe Korn / @xhochy: It is as confusing in reality, here is what they all are (I'm though already 1 year outdated on this):

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Ha, and here are some interesting resources:

asfimport commented 3 years ago

Yesh: There is also https://github.com/Azure/azure-sdk-for-cpp which I’ve tested against adl gen2 .

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Various pointers about the Azure C++ SDK:

https://devblogs.microsoft.com/azure-sdk/cppintro/#example-code-using-the-c-storage-blob-client-libraryhttps://devblogs.microsoft.com/azure-sdk/cppintro/

https://github.com/Azure/azure-sdk-for-cpp

https://github.com/Azure/azure-sdk-for-cpp/blob/master/sdk/storage/azure-storage-blobs/sample/blob_getting_started.cpp

https://azure.github.io/azure-sdk-for-cpp/ (note the API docs are segregated per component, with separate land pages for Core, Storage, Blog Storage, etc.)

Note that the SDK requires C++14.

 

asfimport commented 2 years ago

Tom Augspurger / @TomAugspurger: Does Arrow support C+\14 features now (or more specifically, is the SDK being C\14 a problem?) From https://issues.apache.org/jira/browse/ARROW-13744 it seems like C\14 is at least tested, but https://github.com/apache/arrow/blame/master/docs/source/developers/cpp/building.rst#L40 says a "A C\+11-enabled compiler. " is required.

asfimport commented 2 years ago

Neal Richardson / @nealrichardson: I'm not an expert here, but I think we could use C+14 if required and if the compiler supports it. If the compiler doesn't support C 14, we wouldn't be able to build the azure sdk. So the line would be that Arrow requires C 11 at a minimum, and some features are only available with C+14.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: We require only C+11 in the codebase. We might add C+14-requiring optional components if desired, but that will add complication to the build setup.

asfimport commented 2 years ago

Shashanka Balakuntala Srinivasa: hi @pitrou , we were looking into implementing this feature from our side. I did try to compile the whole arrow code base with c+14 and ran unit tests as well. Everything is passing in local and as mentioned before : [ARROW-13744] [CI] c+14 and 17 nightly job fails - ASF JIRA (apache.org) ticket mentions we have daily build run for validation which are passing. 

Since the azure sdks depend on c+14 features, and since we have the code compiled in c 14, can we look into upgrading the c+ version to 14?  Let me know if there are any issues. I will be happy to take those and work on them.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: [~balakuntala] We can probably require C+\14 for the Azure filesystem implementation only, but the rest of Arrow should remain C+11-compatible.

asfimport commented 2 years ago

Dean MacGregor: If someone wants to work on this but doesn't have an Azure account let me know.  I can make a storage account for this development/testing

asfimport commented 2 years ago

Bipin Mathew: I have a rudimentary implementation of this that supports import and export to Azure. I tried to align as closely as possible to the s3 implementation, however as I needed it only for a specific use case and have not had a chance to implement all the same methods. Depending on bandwidth, I could probably build out the implementation further. I have never contributed to this project. Do all endpoints need to be implemented before it can be merged into the code base? Can it be built out over time? I attach what I have so far here.

azfs.hazfs.cc

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: [~mathewb001] Well, there's a Github PR open already, did you take a look?

asfimport commented 2 years ago

Bipin Mathew: Oh this is much further along than what I have offered. Looking forward to its release so I can cut over to it.

av8or1 commented 1 year ago

We too are in need of this feature. Any word as to whether it will be included in the version 11 release? Seems like the deadline is kinda tight at this point.

pitrou commented 1 year ago

@h-vetinari Do you know if there's a conda-forge package for the Azure Blob Storage C++ library? I couldn't find one.

h-vetinari commented 1 year ago

@h-vetinari Do you know if there's a conda-forge package for the Azure Blob Storage C++ library? I couldn't find one.

I'm not aware of any either, but I don't know. What would be the sources of that C++ lib? If it's open source we could bring it to conda-forge eventually of course.

pitrou commented 1 year ago

@h-vetinari I think it's this: https://github.com/Azure/azure-sdk-for-cpp/tree/main

Tom-Newton commented 1 year ago

I think we're ready to start implementing the filesystem itself. Looking at how GCS was done the next part was an implementation of arrow::io::InputStream and OpenInputStream.

However it looks like arrow::io::RandomAccessFile is a superset of arrow::io::InputStream so I'm think it makes sense to just implement arrow::io::RandomAccessFile. This is what https://github.com/apache/arrow/pull/12914 did.

I would propose we move forward with implementing arrow::io::RandomAccessFile, OpenInputStream and OpenInputFile as in https://github.com/apache/arrow/pull/12914. One change I would suggest is to avoid depending on whether the storage account has hierarchical namespace enabled. Hierarchical namespace is important for listing and renames if you want to make them faster but for blob reads I don't think it should matter, and it adds complexity.

kou commented 1 year ago

It makes sense. Could you open a new issue for it and work on it? @srilman may want to help this.

felipecrv commented 1 year ago

@Tom-Newton is there any ticket here for which you don't have work-in-progress code? I could work on them in parallel with you.

Tom-Newton commented 1 year ago

I have not started any of https://github.com/apache/arrow/issues/38330, https://github.com/apache/arrow/issues/38333 or https://github.com/apache/arrow/issues/38335 yet. I plan to start working on one of them relatively soon though.

felipecrv commented 1 year ago

I have not started any of #38330, #38333 or #38335 yet. I plan to start working on one of them relatively soon though.

Thank you! If I start working on one of them I will assign the issue to myself to let you know I'm on it.

mavam commented 1 year ago

@Tom-Newton +1 from an outside user for this feature. Our users are actively asking for it. Happy to help testing.

kou commented 1 year ago

I've added subissue list in this description.

av8or1 commented 11 months ago

Hi- I'm looking to contribute. This would be my first foray into the process however, so could Tom or other recommend a fairly straightforward task? I was considering #38703 ... would that be a good candidate?

Tom-Newton commented 11 months ago

I was considering #38703 ... would that be a good candidate?

I think that would be a good one to start with. Things tend to get more complicated when they deal with directories. You can assign yourself by commenting "take" on the GitHub issue.

It will be great to have more contributors working on this 🙂

Tom-Newton commented 11 months ago

Does anyone have any thoughts on starting python bindings for this soon? I know the filesystem is not fully implemented but a lot of it is and I think its enough to already be very useful. For example my usecase, actually only needs random access file reads and default credential auth.

I would probably suggest adding one or 2 more auth methods so we know how the configuration is going to look then I think we could create the python bindings.

nosterlu commented 11 months ago

I can help out testing python bindings! Looking forward to testing to see how this improve read speeds compared vs adlfs and pyarrow.

I use Azure storage with either a token (created with adlfs and web authentication) or a connection string directly. So one of those two ways to connect would be neccessary for me to help testing.

Mostly I connect to a hive partitioned parquet dataset, but also towards individual files.

kou commented 11 months ago

Could you open a new issue for Python bindings and work on it? We don't need to mark the new issue as a sub issue of this issue because it's a separated task. (This issue focuses on C++ implementation.)

raulcd commented 10 months ago

I am moving this umbrella issue to 16.0.0

felipecrv commented 10 months ago

I am moving this umbrella issue to 16.0.0

That's fine. Thanks.

av8or1 commented 8 months ago

kou/felipecrv - How close is this to being done? I see a few green-colored items in the list above, but they seem to be completed already (at least the ones I looked at did). Is there anything else we can work on? Will this make it into version 16.0.0? Thanks

felipecrv commented 8 months ago

@av8or1 most of it works now. s3fs doesn't even support Move. AzureFileSystem supports Move on accounts with HNS enabled. You don't have to wait for this issue to be closed to start using what's already merged and ready to be part of the 16.0.0 release.

av8or1 commented 8 months ago

Hi felipe- Thanks. Well, the company can't utilize any library that isn't an official release. I suppose that I could begin writing the code that will utilize the ADLS stuff now, then when 16.0.0 is released (April?), I would be able to produce our product shortly thereafter. What remains to be completed, by the way? Anything I could help with? Thanks

felipecrv commented 8 months ago

@av8or1 as I said above: Move with Blobs API (not a critical feature at all, s3fs doesn't even support Move). Python bindings (PR is open), and URI parsing (PR is open). Are any of these a dealbreaker for you? Everything else will be available in 16.0.0.

av8or1 commented 7 months ago

@felipecrv OK thank you. Work has been busy. Just now looking at this again. It appears that @kou has completed the URI parsing business (#40028). Thus I will prepare on my end to use the library when it is released. Hopefully in April. Thanks

wirable23 commented 7 months ago

Everything else will be available in 16.0.0.

@felipecrv do you know when 16.0.0 would be available?

felipecrv commented 7 months ago

Everything else will be available in 16.0.0.

@felipecrv do you know when 16.0.0 would be available?

In April. The time it takes for the release also depends on how smooth the packaging and publishing process goes.

Tom-Newton commented 7 months ago

I think ideally https://github.com/apache/arrow/issues/40036 would be taken care of before the 16.0.0 release. I don't have any real world performance numbers but I suspect write performance is currently a bit disappointing.

raulcd commented 7 months ago

Hi @Tom-Newton there's no planned release before 16.0.0 The feature freeze for 16.0.0 is planned for the 8th of April.

felipecrv commented 7 months ago

I opened a MINOR PR expanding some of the docstrings: #40838

raulcd commented 7 months ago

There are some subtasks still opened. @felipecrv should I tag this umbrella issue as 17.0.0?

felipecrv commented 7 months ago

There are some subtasks still opened. @felipecrv should I tag this umbrella issue as 17.0.0?

AzureFileSystem is already usable and feature-complete even without these open issues being fixed. I'm in favor of closing (marking it as complete) making it part of the 16.0 release in the logs.

kou commented 7 months ago

Let's close this as complete. We don't need an umbrella issue for AzureFileSystem. We can just use separated issues like other components.

kou commented 7 months ago

Note that the current AzureFileSystem's CopyFile() doesn't work with Azure hierarchical namespace support. See also: #41095

Some other AzureFileSystem implementations for Azure hierarchical namespace support have some problems: #41034 I want to add the fix of this (#41068) to 16.0.0.