apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.69k stars 3.56k forks source link

[C++][Python] S3 tag support on write #32083

Open asfimport opened 2 years ago

asfimport commented 2 years ago

S3 allows tagging data to better organize ones data (https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html) We use this for efficient downstream processes/inventory management.

Currently arrow/pyarrow does not allow tags to be added on write. This is causing us to scan the bucket and re-apply the tags after a pyrrow based process has run.

I looked through the code and think that it could potentially be done via the metadata mechanism.

The tags need to be added to the CreateMultipartUploadRequest here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L1156

See also

http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_s3_1_1_model_1_1_create_multipart_upload_request.html#af791f34a65dc69bd681d6995313be2da

Reporter: André Kelpe

Note: This issue was originally created as ARROW-16746. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: We should try to do this in a way that's generic enough and can be implemented in other filesystem types.

For example I see that GCS supports custom metadata: https://cloud.google.com/storage/docs/metadata#custom-metadata

Some local filesystems support extended attributes: https://en.wikipedia.org/wiki/Extended_file_attributes

cc @coryan

asfimport commented 2 years ago

Steve Loughran: hadoop s3a maps user attributes to the filesystem XAttr APIs, very soon to let you also set them when you create a file.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: [~stevel@apache.org] Thanks for the information. What are "user attributes" in this context? Are you talking about "User-defined object metadata" as defined in https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html ?

asfimport commented 2 years ago

Steve Loughran: yes. we use them a bit in the s3a committers, to annotate a zero byte marker file with the length they will finally ;get when manifest at their destination. in HADOOP-17833 that's beiing exposed in the createFile(path) buiilder api, where apps can set headers at create time. presumably gcs and azure could be wired up differently. they both have the advantage that you can edit file attributes after creation.

asfimport commented 1 year ago

Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

elveshoern32 commented 1 year ago

In the original request the term 'tag' is used, later the term 'metadata' is used.

I know S3 only:

In S3 'metadata', more precisely 'user-defined object metadata' are considered part of the object and are thus immutable, they have to be added at creation time. 'Tags', however, are a different animal and can be added/changed/removed at any time.

Now neither 'tags' nor 'user-defined object metadata' are currently supported by Arrow, only a few 'system-defined object metadata' are. For my usecase it would be helpful to use at least one of both. The details of the API are of minor importance to me.

We should try to do this in a way that's generic enough and can be implemented in other filesystem types.

Agreed. However, I feel that Arrow should not impose any further limits to which metadata are possible. Different storage technologies show different characteristics; Arrow shouldn't implement just the smallest common subset.

This is causing us to scan the bucket and re-apply the tags after a pyrrow based process has run.

This is exactly what I'd like to avoid, because

  1. S3 calls are actually quite costly (in terms of CPU and wall clock time) and

  2. this approach leads to a time window of unknown length with the objects carrying the wrong set of tags, which might lead to the ILM (Information Lifecycle Management) taking wrong decisions.