boto / boto3

AWS SDK for Python
https://aws.amazon.com/sdk-for-python/
Apache License 2.0
9.03k stars 1.87k forks source link

Directory upload/download with boto3 #358

Open dduleep opened 8 years ago

dduleep commented 8 years ago

In the PHP sdk has some function for download and upload as directory(http://docs.aws.amazon.com/aws-sdk-php/v2/guide/service-s3.html#uploading-a-directory-to-a-bucket) Is there any similar function available with boto3?

if there is not such function, what kind of method/s most sufficient for download/upload directory

note My ultimate target is create sync function like aws cli

now i'm using download/upload files using https://boto3.readthedocs.org/en/latest/reference/customizations/s3.html?highlight=upload_file#module-boto3.s3.transfer

shinichi-takayanagi commented 4 years ago

We definitely want this functionality and are waiting for almost five years. All of you are deliberate guys!

antgel commented 4 years ago

Lot of pointless comments here, and everyone subscribed gets a notification whenever someone adds one. Please use the reaction thumbs rather than adding +1 or whatever. Also, I don't know if you're new in the open-source community, but the software is worked on by volunteers and comes without warranty, so feel free to submit a patch or pay a developer to do so. Just saying "we really need this over and over" seems entitled and unconstructive.

jbouse commented 4 years ago

@adaranutsa

Just make sure you include the asterisk on the function sign:

def aws_cli(*cmd):

And also on the call to main(*cmd).

I actually found with Python 3.7 I needed to use main(cmd) in order to call as aws_cli('s3', 'sync', ...) without having to enclose in [] or (). I tried an found that after calling aws_cli() with enclosed parameters and then modifying to use main([*cmd]) and removing the enclosed parameters to the aws_cli() call. Trial and error to work the solution.

JavierClearImageAI commented 4 years ago

Here's another implementation that parallelizes the upload.

import os
from concurrent import futures
import boto3

def upload_directory(directory, bucket, prefix):
    s3 = boto3.client("s3")

    def error(e):
        raise e

    def walk_directory(directory):
        for root, _, files in os.walk(directory, onerror=error):
            for f in files:
                yield os.path.join(root, f)

    def upload_file(filename):
        s3.upload_file(Filename=filename, Bucket=bucket, Key=prefix + os.path.relpath(filename, directory))

    with futures.ThreadPoolExecutor() as executor:
        futures.wait(
            [executor.submit(upload_file, filename) for filename in walk_directory(directory)],
            return_when=futures.FIRST_EXCEPTION,
        )

This works great ... it took only 1.5 secs to upload 150 images from an ec2 machine

vishnu-dev commented 3 years ago

AWS-CLI has the feature. But why not the boto3? This issue was opened in 2015, It's been 5 years. If this feature is not going to be implemented, please close and respond to why. At least people will stop following and stop getting emails.

rams3sh commented 3 years ago

I used @rectalogic's code and modified a little since I had to know the exceptions encountered as the provided code just stopped without indication on the type of exceptions (if any encountered). Below my version :-

from concurrent import futures
import boto3 
import os

def upload_directory(directory, bucket, prefix, boto3_session: boto3.Session = boto3.Session()):
    s3 = boto3_session.client("s3")

    def error(e):
        raise e

    def walk_directory(directory):
        for root, _, files in os.walk(directory, onerror=error):
            for f in files:
                yield os.path.join(root, f)

    def upload_file(filename):
        s3.upload_file(Filename=filename, Bucket=bucket, Key=prefix + "/" + os.path.relpath(filename, directory))

    with futures.ThreadPoolExecutor() as executor:
        upload_task = {}

        for filename in walk_directory(directory):
            upload_task[executor.submit(upload_file, filename)] = filename

        for task in futures.as_completed(upload_task):
            try:
                task.result()
            except Exception as e:
                print("Exception {} encountered while uploading file {}".format(e, upload_task[task]))
shawngmc commented 2 years ago

The way I see it, there are two ways this could be implemented:

(1) Directly adapt the aws-cli source, replacing their AWS calls with boto3 calls. It appears to be implemented in https://github.com/aws/aws-cli/blob/awscli/customizations/s3/, but my python is rusty. subcommands.py parses the aws s3 command and builds an command dictionary. For example, Syncstrategies are the different syncing rules, which are called by comparitor.py. This could create an implementation with identical behavior to the cli with potentially less work. However, this may also create an expectation that it will always behave similar to the aws cli.

(2) Implement a new sync, based of principals of tools like rsync. This is faster to proof of concept, but requires more work as you need to make sync logic. That sync logic is tricky due to s3 not quite being a normal filesystem.

I'd also like to see this functionality. It is definitely a weird middle ground - it's not a direct call to an S3 API, but it's a very common use case. Using the aws cli for this functionality in a python script is acceptable, but a pain.

bdrx312 commented 2 years ago

s3 recently added support for checksums (https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/). A sync feature that could sync based on checksums like rsync would be really useful.

VannTen commented 2 years ago

Hi.

I recently implemented a s3 client sync (syncing two different bucket on differents providers) using boto3. I was wondering if upstream (aka, boto3) would be interested by a PR adding this feature directly to boto3.

My implementation is basically : two sorted generators fed into a set difference -> produce a generator of keys to be synced -> sync function. s3_generator : https://github.com/VannTen/document-sync-job/blob/work_avoidance_with_generators/app.py#L169-L178 set_difference : https://github.com/VannTen/document-sync-job/blob/work_avoidance_with_generators/lazy_set_ops.py sync_function : https://github.com/VannTen/document-sync-job/blob/work_avoidance_with_generators/app.py#L144-L148

A default + overridable part would provide: -> sync between local dir (or another source) : override one generator + sync_function -> rsync-like: override set_diff / provide custom equality

One of my guiding principles was to have O(1) space complexity, to be able to handle very large collection. (The linked PR does not have that, because of concurrent.futures.as_expectedwhich eagerly consume its argument, but it should be fixable).

Would the project maintainers be interested in a PR implementing something, and do they have pointers on what the API should look like / where this should live ?

Thanks.

bdrx312 commented 2 years ago

Why is there no response from the maintainers of this library when you have a community that is willing to help support and improve the tool to suit the communities needs?

a-canela commented 1 year ago

Sorry for not passing through contributing guidelines as it should, but I'll leave this quick draft implementation here in case it helps someone in the meantime:

https://gist.github.com/a-canela/4cbbe20b08ce1fa92ff373d5b60ac9ef