Closed Tom-Newton closed 1 year ago
It seems I don't actually have permissions to assign myself, without opening a PR but I'm going to start working on this.
Initially I assumed that the correct behaviour would be to return FileType::NotFound
if the specified blob does not exist, given that blob storage does really have directories. However having looked closer at the s3 and gcs implementations it looks like they try listing the path and consider the path to be a "directory" if listing returns at least one result.
Presumably we want to implement the same thing for Azure, and probably this should follow a slightly different code path if hierarchical namespace is enabled. With HNS we can avoid the extra listing operation, so that would provide a performance advantage. This makes things more complicated but this was inevitable at some point so I guess I may as well do it now.
It seems I don't actually have permissions to assign myself, without opening a PR but I'm going to start working on this.
We can use "take" only comment for it: https://arrow.apache.org/docs/developers/bug_reports.html#issue-assignment
I think this is ready for review now, so hopefully what I've done makes sense. https://github.com/apache/arrow/pull/38505
Unfortunately I could not test all the hierarchical namespace related stuff with azurite
. I created some tests for testing against real blob storage accounts which will be automatically skipped if the required storage account details are not provided. If anyone has any ideas on how to handle this better, that would be great. Potentially CI could be configured to run these but for now at least its possible for developers to run them manually.
Describe the enhancement requested
Implement
Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path)
.Logic I suggest we apply:
AzurePath
.path.container.empty()
then this is the root of the storage account. Its a directory and we don't really have any more info to add.path.path_to_file.empty()
then this is the root of a container. Try to fetch properties of the container. If the container exists and notIsDeleted()
then its a directory and probably we can add the last modified time.All of this only needs the simple blob endpoints - nothing exclusive to hierarchical namespace enabled accounts.
The other GetFileInfo method requires doing directory listing which will be some additional complexity so I propose implementing that in a separate PR.
Related Issues:
Component(s)
C++