apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.58k stars 3.54k forks source link

[C++] Implement GetFileInfo for a single file in Azure filesystem #38335

Closed Tom-Newton closed 1 year ago

Tom-Newton commented 1 year ago

Describe the enhancement requested

Implement Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path).

Logic I suggest we apply:

  1. Parse the string path to AzurePath.
  2. If path.container.empty() then this is the root of the storage account. Its a directory and we don't really have any more info to add.
  3. If path.path_to_file.empty() then this is the root of a container. Try to fetch properties of the container. If the container exists and not IsDeleted() then its a directory and probably we can add the last modified time.
  4. Call BlobClient::GetProperties(). If the storage account is hierarchical namespace it possible that we will get the properties for a directory, so we should always check this https://github.com/Azure/azure-sdk-for-cpp/blob/dd236311193c6a3debf3b12c47f14e49a20c72c7/sdk/storage/azure-storage-files-datalake/src/datalake_utilities.cpp#L86-L91.

All of this only needs the simple blob endpoints - nothing exclusive to hierarchical namespace enabled accounts.

The other GetFileInfo method requires doing directory listing which will be some additional complexity so I propose implementing that in a separate PR.

Related Issues:

Component(s)

C++

Tom-Newton commented 1 year ago

It seems I don't actually have permissions to assign myself, without opening a PR but I'm going to start working on this.

Tom-Newton commented 1 year ago

Initially I assumed that the correct behaviour would be to return FileType::NotFound if the specified blob does not exist, given that blob storage does really have directories. However having looked closer at the s3 and gcs implementations it looks like they try listing the path and consider the path to be a "directory" if listing returns at least one result.

Presumably we want to implement the same thing for Azure, and probably this should follow a slightly different code path if hierarchical namespace is enabled. With HNS we can avoid the extra listing operation, so that would provide a performance advantage. This makes things more complicated but this was inevitable at some point so I guess I may as well do it now.

kou commented 1 year ago

It seems I don't actually have permissions to assign myself, without opening a PR but I'm going to start working on this.

We can use "take" only comment for it: https://arrow.apache.org/docs/developers/bug_reports.html#issue-assignment

Tom-Newton commented 1 year ago

I think this is ready for review now, so hopefully what I've done makes sense. https://github.com/apache/arrow/pull/38505

Unfortunately I could not test all the hierarchical namespace related stuff with azurite. I created some tests for testing against real blob storage accounts which will be automatically skipped if the required storage account details are not provided. If anyone has any ideas on how to handle this better, that would be great. Potentially CI could be configured to run these but for now at least its possible for developers to run them manually.