apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.13k stars 3.44k forks source link

[C++] Don't allow the inclusion of passwords (storage account keys) in Auzre ABFS URLs #43197

Open sugibuchi opened 1 month ago

sugibuchi commented 1 month ago

Describe the enhancement requested

Outline

The Azure Blob File System (ABFS) support in Apache Arrow, implemented in C++ API by #18014 and integrated into Python API by #39968, currently allows embedding a storage account key as a password in an ABFS URL.

https://github.com/apache/arrow/blob/r-16.1.0/cpp/src/arrow/filesystem/azurefs.h#L138-L144

However, I strongly recommend stopping this practice for two reasons.

Security

An access key of a storage account is practically a "root password," giving full access to the data in the storage account.

Microsoft repeatedly emphasises this point in various places in the documentation and encourages the protection of account keys in a secure place like Azure Key Vault.

Protect your access keys

Storage account access keys provide full access to the storage account data and the ability to generate SAS tokens. Always be careful to protect your access keys. Use Azure Key Vault to manage and rotate your keys securely. https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string#protect-your-access-keys

Embedding a storage account key in an ABFS URL, which is usually not considered confidential information, may lead to unexpected exposure of the key.

Interoperability with other file system implementations

For historical reasons, the syntax of the Azure Blob File System (ABFS) URL is inconsistent between different file system implementations.

Original implementations by Apache Hadoop's hadoop-azure package link

This syntax is widely used, particularly by Apache Spark.

Python adlfs for fsspec link

Rust object_store::azure link

DuckDB azure extension link

Apache Arrow link

This inconsistency of the syntax already causes problems in applications using different frameworks, including additional overhead to translate ABFS URLs between different syntax. It may also lead to unexpected behaviours due to misinterpretation of the same URL by different file system implementations.

I believe a new file system implementation should respect the existing syntax of a URL scheme and SHOULD NOT invent new ones. As far as I understand, no other ABFS file system implementation allows embedding storage account keys in ABFS URLs.

Component(s)

C++, Python

kou commented 1 month ago

Could you clarify your suggestion?

The followings?

BTW, it seems that you misinterpreted supported formats:

Apache Arrow link

* Hadoop-compatible URL schemes, and
* adfs[s]://<container>:<password>@<account>.dsf.core.windows.net/path/to/file
* adfs[s]://<account>.dsf.core.windows.net/<container>/path/to/file
* adfs[s]://<password>@<account>.dsf.core.windows.net/<container>/path/to/file

All of them listed by you are compatible with https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri . "3." and "4." in the Arrow source are missed. "3." is for Azurite. It's helpful for development. "4." is for compatibility with adlfs, object_store::azure and DuckDB.

kou commented 1 month ago

@Tom-Newton Could you take a look at this?

sugibuchi commented 1 month ago

@kou Thank you very much for your answer and correction.

  • We should reject URLs that have :password in the "userinfo" URI part
  • We should reduce supported URL patterns

I suggest only the first point. Regarding the second point, I don't have a strong objection to the existing patterns as long as a file system implementation supports the Hadoop-compatible syntax, as the Hadoop-compatible syntax is the only one commonly supported by most file system implementations (except DuckDB...).

However, as having storage account keys in ABFS URLs is a new practice invented by Apache Arrow and has security concerns, I suggest removing this feature.

BTW, it seems that you misinterpreted supported formats:

You are right. I have updated the issue description, and let me copy-paste the original comment in the source code.

  /// 1. abfs[s]://[:\<password\>@]\<account\>.blob.core.windows.net
  ///    [/\<container\>[/\<path\>]]
  /// 2. abfs[s]://\<container\>[:\<password\>]@\<account\>.dfs.core.windows.net
  ///     [/path]
  /// 3. abfs[s]://[\<account[:\<password\>]@]\<host[.domain]\>[\<:port\>]
  ///    [/\<container\>[/path]]
  /// 4. abfs[s]://[\<account[:\<password\>]@]\<container\>[/path]

In my view, only the second pattern (without password) is Hadoop-compatible. 1 and 3 do not include the container (file system) name in the authority part of the patterns. 4 has a different syntax for the authority part.

But as I explained above, I don't have strong objections against these patterns except for having passwords in URLs.

Tom-Newton commented 1 month ago

To be honest I think I will struggle to find time to work on this. Personally I'm in favour of the proposed changes: removing some of this extra URL parsing functionality, including passwords. However I'm also not especially concerned about it. Certainly account key is more concerning but it is common practice to use SAS tokens in blob storage URLs.

kou commented 1 month ago

I'm neutral for this proposal.

If we reject the password value, users can't use account key based authentication with the URI interface. It'll useful for local development with Azurite. FYI: The Arrow's Azure filesystem doesn't send the URI specified by an user to somewhere and Azure SDK for C++ doesn't send raw account key to Azure. Azure SDK for C++ uses account key to generate hash value for authentication.

Could you share bad scenarios you think?

kou commented 1 month ago

Could you start a discussion on the dev@arrow.apache.org mailing list too to collect more feedback for this? See also: https://arrow.apache.org/community/#mailing-lists

sugibuchi commented 1 month ago

If we reject the password value, users can't use account key based authentication with the URI interface. It'll useful for local development with Azurite.

I use Apache Arrow mainly in Python code. Let me explain by using PyArrow as an example.

When working with the PyArrow API, we have two methods to specify a file system.

  1. Create a FileSystem instance and explicitly set it as an argument.
  2. Let PyArrow infer a file system from a file path URL.
import pyarrow.parquet as pq

# Explicitly set
s3 = fs.S3FileSystem(..)
pq.read_table("my-bucket/data.parquet", filesystem=s3)

# Infer from a URL
pq.read_table("s3://my-bucket/data.parquet")

For 1, we don't need to embed a storage account key or any other credentials for file system access in a file path URL as long as we can set them when we create a file system instance.

s3 = fs.S3FileSystem(access_key=...)

For 2, many existing file system libraries provide an interface to configure credentials for file system access in a global context.

Even if it looks convenient, embedding credentials in a file path URL is generally unnecessary. Other file system implementations work well without this method.

In Azure SDK, EnvironmentCredential used by DefaultAzureCredential defines a set of environment variables to configure credentials

There is no standardized environment variable for setting storage account keys but AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY are widely used (example).

We should consider these common practices instead of inventing new ABFS URL syntax.

sugibuchi commented 1 month ago

Could you share bad scenarios you think?

There are several Azure Blob File System implementations in Python, and we frequently need to use multiple implementations in the same code. However,

  1. Most ABFS implementations, except Arrow's AzureFileSystem, do not assume that ABFS URLs can contain confidential information like storage account keys.
  2. It is not always clear which implementation is actually used in a library.
    • PyArrow has native AzureFileSystem implementation since Arrow 16.0.0. However, delta-io implemented on Arrow uses Rust object_store.
    • Pandas initially used the fsspec's ABFS implementation but silently started using Arrow's native implementation after the release of Arrow 16.0.0 (#41496).
    • DuckDB has native ABFS support. But Rust object_store is eventually used when reading Delta Lake into DuckDB by using ibis API.

Because of 2, an Arrow's style ABFS URL containing a storage account key can be accidentally passed to a different ABFS implementation. However, the different implementation usually does not assume the passed URL contains a storage account key, as explained in 1.

This leads to the rejection of the URL and an error message like the one below that can be exposed to error logs, HTTP error responses, etc.

Invalid file path: abfs://my-container:(plain text of a storage account key)@mycontainer.dfs.core.windows.net/...

Having storage account keys in ABFS URLs can cause this kind of interoperability issue with other ABFS implementations and unexpected exposure of keys.

sugibuchi commented 1 month ago

Certainly account key is more concerning but it is common practice to use SAS tokens in blob storage URLs.

This is true, but a blob URL containing an SAS token is usually (and should be) treated as confidential information that must be carefully handled to avoid unexpected leaks.

We cannot apply the same practice to file paths in general that appear in many places in a code.

kou commented 1 month ago

Could you start a discussion on the dev@arrow.apache.org mailing list too to collect more feedback for this? See also: https://arrow.apache.org/community/#mailing-lists

@sugibuchi Could you do this with the additional information?