S3 sync recursively with per-object metadata

jcmcken commented 8 years ago

I'm looking to take advantage of the aws s3 sync command, but provide per-object metadata (i.e. metadata that can change per object) rather than provide global metadata with --metadata.

Right now, I have basically a couple of options:

Write my own syncing procedures. This would seem duplicative of the work/testing that has gone into the sync command.
Include metadata in separate objects, e.g. /path/to/object and /path/to/object.meta. This isn't the greatest, means I have to pay extra and also manage the metadata within my application.
Upload on a per-object basis. This isn't going to be performant for my use case.

What would be nice is if I could somehow indicate to the CLI that I want to map each object to a set of metadata, and then upload each object with that metadata. A couple of solutions come to mind:

A giant JSON file mapping each key name to a metadata hash e.g.:

{
  "path/to/object1": {"key1": "value1", "key2": "value2},
  ...etc...
}

$ aws s3 sync /some/dir s3://somebucket --metadata-mapping /path/to/meta/mapping.json

Some convention for writing metadata locally into a separate file per intended object, and having the sync command read the metadata for each object prior to uploading. For example, I could have a local directory:

$ ls /path/to/local/files
file1
file1.meta
$ cat file1.meta
{
  "key1": "value1",
  "key2": "value2"
}
$ aws s3 sync /path/to/local/files s3://somebucket --object-metadata '$filename.meta'

(So when this is run, the $filename.meta files would just be read for metadata, and would not be transferred)

A callback that takes the local filename as a parameter and spits out the metadata, e.g.

$ ls /path/to/local/files
file1
$ lookup-metadata.py /path/to/local/files/file1
{
  "key1": "value1"
}
$ aws s3 sync /path/to/local/files s3://somebucket --metadata-callback lookup-metadata.py

Alternatively, what would be really great is if the syncing functionality were available independently of the CLI from within Python (without requiring me to figure out the internals of how to properly initialize the CLI environment, etc.), so that I could subclass and customize the process. I started going down this route somewhat, but am worried that this API is not for public consumption and would break in the future.

Any thoughts?

JordonPhillips commented 8 years ago

I'm -1 on adding that. I don't think providing that kind of mapping is a very good experience. At that point you're effectively setting everything manually anyway, so it would take just as much time to perform all those requests.

As far as using our code, we don't guarantee we won't break internals. However, it is MIT licensed so feel free to vendor or copy it.

jcmcken commented 8 years ago

In my use case, the metadata is precomputed against the objects I'm trying to store and placed in a storage backend (details not important, but e.g. MongoDB). All I'm doing is retrieving the data from that backend and storing it with the objects. If I do this object-by-object, then I need to recreate threaded uploads, multipart handling, sync strategies, etc -- all of the things that the sync command normally would do for me. I then need to hook in my logic to make sure the correct metadata is stored with each object. If the CLI supports a mapping or callback, then I just need to translate the data into the correct format (which I can stage ahead of time), and then run the sync

jamesls commented 8 years ago

+1 for me. I think this is a reasonable request. Out of all the proposed solutions, I like the metadata JSON file the best. I'm inclined to mark this as a feature request.

@jcmcken One other thing worth considering is the work @kyleknap's been doing for s3transfer. It's still under active development so I wouldn't recommend it for general use just yet, but the idea is to create a good python API for the functionality that's currently exposed in the AWS CLI.

rmharrison commented 8 years ago

@jamesls Since s3transfer is still very much in active development, do you have a recommendation for syncing with per file metadata?

node-s3-client is the most promising library I've come across, but the project seems to be having problems with the underlying AWS SDK, see https://github.com/andrewrk/node-s3-client/issues/129

ASayre commented 6 years ago

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

jamesls commented 6 years ago

Based on community feedback, we have decided to return feature requests to GitHub issues.

pgriess commented 3 years ago

I'd like this to exist and am willing to spend some time building it.

What is the best way to proceed here? I can jump right to submitting a PR for the single metadata JSON file, but would it be helpful to discuss design / implementation strategy first? I've never committed to this repo before, so if there are any pointers to related code / suggested supporting infrastructure, I'm all ears.

tim-finnigan commented 3 years ago

Hi @pgriess thanks for your willingness to contribute. If you want to create a PR then I recommended reading the contributing guide here: https://github.com/aws/aws-cli/blob/master/CONTRIBUTING.md

You can expand on your proposed implementation here or in a PR. I think looking through these s3 sync customizations is a good place to start: https://github.com/aws/aws-cli/tree/develop/awscli/customizations/s3/syncstrategy

aws / aws-cli

S3 sync recursively with per-object metadata #2045