Closed asfimport closed 7 months ago
Wes McKinney / @wesm: I see that TileDB (MIT license) has built a C++ wrapper for Azure
https://github.com/TileDB-Inc/TileDB/blob/dev/tiledb/sm/filesystem/azure.cc
No this is not moved to fsspec, this is a C++ ticket
Antoine Pitrou / @pitrou: I'm not sure I understand the relationship between Blob Store and Data Lake. Is Data Lake a higher-level layer above Blob Store? Or are they two different services that would need separate filesystem implementations?
Antoine Pitrou / @pitrou: According to https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction ,
Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster recovery capabilities. I'm not sure this means the same C++ API can be used to access both, though.
Uwe Korn / @xhochy: It is as confusing in reality, here is what they all are (I'm though already 1 year outdated on this):
Antoine Pitrou / @pitrou: Ha, and here are some interesting resources:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace : a Data Lake admin can configure a it with a hierarchical filesystem semantics. Apparently it may enable different APIs (??).
Yesh: There is also https://github.com/Azure/azure-sdk-for-cpp which I’ve tested against adl gen2 .
Antoine Pitrou / @pitrou: Various pointers about the Azure C++ SDK:
https://github.com/Azure/azure-sdk-for-cpp
https://azure.github.io/azure-sdk-for-cpp/ (note the API docs are segregated per component, with separate land pages for Core, Storage, Blog Storage, etc.)
Note that the SDK requires C++14.
Tom Augspurger / @TomAugspurger: Does Arrow support C+\14 features now (or more specifically, is the SDK being C\14 a problem?) From https://issues.apache.org/jira/browse/ARROW-13744 it seems like C\14 is at least tested, but https://github.com/apache/arrow/blame/master/docs/source/developers/cpp/building.rst#L40 says a "A C\+11-enabled compiler. " is required.
Neal Richardson / @nealrichardson: I'm not an expert here, but I think we could use C+14 if required and if the compiler supports it. If the compiler doesn't support C 14, we wouldn't be able to build the azure sdk. So the line would be that Arrow requires C 11 at a minimum, and some features are only available with C+14.
Antoine Pitrou / @pitrou: We require only C+11 in the codebase. We might add C+14-requiring optional components if desired, but that will add complication to the build setup.
Shashanka Balakuntala Srinivasa: hi @pitrou , we were looking into implementing this feature from our side. I did try to compile the whole arrow code base with c+14 and ran unit tests as well. Everything is passing in local and as mentioned before : [ARROW-13744] [CI] c+14 and 17 nightly job fails - ASF JIRA (apache.org) ticket mentions we have daily build run for validation which are passing.
Since the azure sdks depend on c+14 features, and since we have the code compiled in c 14, can we look into upgrading the c+ version to 14? Let me know if there are any issues. I will be happy to take those and work on them.
Antoine Pitrou / @pitrou:
[~balakuntala]
We can probably require C+\14 for the Azure filesystem implementation only, but the rest of Arrow should remain C+11-compatible.
Dean MacGregor: If someone wants to work on this but doesn't have an Azure account let me know. I can make a storage account for this development/testing
Bipin Mathew: I have a rudimentary implementation of this that supports import and export to Azure. I tried to align as closely as possible to the s3 implementation, however as I needed it only for a specific use case and have not had a chance to implement all the same methods. Depending on bandwidth, I could probably build out the implementation further. I have never contributed to this project. Do all endpoints need to be implemented before it can be merged into the code base? Can it be built out over time? I attach what I have so far here.
Antoine Pitrou / @pitrou:
[~mathewb001]
Well, there's a Github PR open already, did you take a look?
Bipin Mathew: Oh this is much further along than what I have offered. Looking forward to its release so I can cut over to it.
We too are in need of this feature. Any word as to whether it will be included in the version 11 release? Seems like the deadline is kinda tight at this point.
@h-vetinari Do you know if there's a conda-forge package for the Azure Blob Storage C++ library? I couldn't find one.
@h-vetinari Do you know if there's a conda-forge package for the Azure Blob Storage C++ library? I couldn't find one.
I'm not aware of any either, but I don't know. What would be the sources of that C++ lib? If it's open source we could bring it to conda-forge eventually of course.
@h-vetinari I think it's this: https://github.com/Azure/azure-sdk-for-cpp/tree/main
I think we're ready to start implementing the filesystem itself. Looking at how GCS was done the next part was an implementation of arrow::io::InputStream
and OpenInputStream
.
However it looks like arrow::io::RandomAccessFile
is a superset of arrow::io::InputStream
so I'm think it makes sense to just implement arrow::io::RandomAccessFile
. This is what https://github.com/apache/arrow/pull/12914 did.
I would propose we move forward with implementing arrow::io::RandomAccessFile
, OpenInputStream
and OpenInputFile
as in https://github.com/apache/arrow/pull/12914. One change I would suggest is to avoid depending on whether the storage account has hierarchical namespace enabled. Hierarchical namespace is important for listing and renames if you want to make them faster but for blob reads I don't think it should matter, and it adds complexity.
It makes sense. Could you open a new issue for it and work on it? @srilman may want to help this.
@Tom-Newton is there any ticket here for which you don't have work-in-progress code? I could work on them in parallel with you.
I have not started any of https://github.com/apache/arrow/issues/38330, https://github.com/apache/arrow/issues/38333 or https://github.com/apache/arrow/issues/38335 yet. I plan to start working on one of them relatively soon though.
I have not started any of #38330, #38333 or #38335 yet. I plan to start working on one of them relatively soon though.
Thank you! If I start working on one of them I will assign the issue to myself to let you know I'm on it.
@Tom-Newton +1 from an outside user for this feature. Our users are actively asking for it. Happy to help testing.
I've added subissue list in this description.
Hi- I'm looking to contribute. This would be my first foray into the process however, so could Tom or other recommend a fairly straightforward task? I was considering #38703 ... would that be a good candidate?
I was considering #38703 ... would that be a good candidate?
I think that would be a good one to start with. Things tend to get more complicated when they deal with directories. You can assign yourself by commenting "take" on the GitHub issue.
It will be great to have more contributors working on this 🙂
Does anyone have any thoughts on starting python bindings for this soon? I know the filesystem is not fully implemented but a lot of it is and I think its enough to already be very useful. For example my usecase, actually only needs random access file reads and default credential auth.
I would probably suggest adding one or 2 more auth methods so we know how the configuration is going to look then I think we could create the python bindings.
I can help out testing python bindings! Looking forward to testing to see how this improve read speeds compared vs adlfs and pyarrow.
I use Azure storage with either a token (created with adlfs and web authentication) or a connection string directly. So one of those two ways to connect would be neccessary for me to help testing.
Mostly I connect to a hive partitioned parquet dataset, but also towards individual files.
Could you open a new issue for Python bindings and work on it? We don't need to mark the new issue as a sub issue of this issue because it's a separated task. (This issue focuses on C++ implementation.)
I am moving this umbrella issue to 16.0.0
I am moving this umbrella issue to 16.0.0
That's fine. Thanks.
kou/felipecrv - How close is this to being done? I see a few green-colored items in the list above, but they seem to be completed already (at least the ones I looked at did). Is there anything else we can work on? Will this make it into version 16.0.0? Thanks
@av8or1 most of it works now. s3fs
doesn't even support Move
. AzureFileSystem
supports Move
on accounts with HNS enabled. You don't have to wait for this issue to be closed to start using what's already merged and ready to be part of the 16.0.0 release.
Hi felipe- Thanks. Well, the company can't utilize any library that isn't an official release. I suppose that I could begin writing the code that will utilize the ADLS stuff now, then when 16.0.0 is released (April?), I would be able to produce our product shortly thereafter. What remains to be completed, by the way? Anything I could help with? Thanks
@av8or1 as I said above: Move
with Blobs API (not a critical feature at all, s3fs
doesn't even support Move
). Python bindings (PR is open), and URI parsing (PR is open). Are any of these a dealbreaker for you? Everything else will be available in 16.0.0.
@felipecrv OK thank you. Work has been busy. Just now looking at this again. It appears that @kou has completed the URI parsing business (#40028). Thus I will prepare on my end to use the library when it is released. Hopefully in April. Thanks
Everything else will be available in 16.0.0.
@felipecrv do you know when 16.0.0 would be available?
Everything else will be available in 16.0.0.
@felipecrv do you know when 16.0.0 would be available?
In April. The time it takes for the release also depends on how smooth the packaging and publishing process goes.
I think ideally https://github.com/apache/arrow/issues/40036 would be taken care of before the 16.0.0 release. I don't have any real world performance numbers but I suspect write performance is currently a bit disappointing.
Hi @Tom-Newton there's no planned release before 16.0.0 The feature freeze for 16.0.0 is planned for the 8th of April.
I opened a MINOR PR expanding some of the docstrings: #40838
There are some subtasks still opened. @felipecrv should I tag this umbrella issue as 17.0.0?
There are some subtasks still opened. @felipecrv should I tag this umbrella issue as 17.0.0?
AzureFileSystem
is already usable and feature-complete even without these open issues being fixed. I'm in favor of closing (marking it as complete) making it part of the 16.0 release in the logs.
Let's close this as complete.
We don't need an umbrella issue for AzureFileSystem
. We can just use separated issues like other components.
Note that the current AzureFileSystem
's CopyFile()
doesn't work with Azure hierarchical namespace support. See also: #41095
Some other AzureFileSystem
implementations for Azure hierarchical namespace support have some problems: #41034
I want to add the fix of this (#41068) to 16.0.0.
Subissues:
Reporter: Wes McKinney / @wesm Assignee: Shefali Singh
Related issues:
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-2034. Please see the migration documentation for further details.