aws / aws-sdk

Landing page for the AWS SDKs on GitHub
https://aws.amazon.com/tools/
Other
68 stars 13 forks source link

ListObjectsV2 API: enable sorting by LastModified time (with support for over 1000 objects) #539

Closed AnujSingh12 closed 1 year ago

AnujSingh12 commented 1 year ago

This is with reference to ticket https://github.com/aws/aws-sdk/issues/11

I have a use case, where I need to backfil data from files present in S3 bucket to a database, I am running a cron job which is supposed to process the files in chronological order, so that the latest information is persisted in the database, but i have no such feature , to sort the files based on last modified attribute. It would be great, if we can have that feature.

tim-finnigan commented 1 year ago

Thanks @AnujSingh12 for reaching out. I updated your title to add a bit more detail. In that issue you linked I just added an example which I think addresses your use case:

import boto3

s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
response_iterator = paginator.paginate(Bucket='bucket-name')

objects = []

for response in response_iterator:
    if 'Contents' in response:
        for obj in response['Contents']:
            objects.append(obj)

sorted_objects = sorted(objects, key=lambda x: x['LastModified'], reverse=True)

Also as mentioned there:

...the ListObjectsV2 API returns a maximum of 1000 results per request. But you can use the paginator to get the full results and then sort by last modified time. (For more info on boto3 paginators please see this documentation).

I can reach out to the S3 team and see if they could add support for sorting by last modified time to the API, and support requests over 1000. But for now I think that workaround shared above should help. Please let me know if you had any other questions.

P89236193

AnujSingh12 commented 1 year ago

Thank you @tim-finnigan for your response. In my use case, I have large number of files, and each file has multiple records in it. Something like this:

image

Each dateStr folder can contain subfolders as follows:

image

And each of these subfolders could have any number of files ( in the screenshot added below, we can see 1 file )

image

So to avoid cronJob running out of memory, I am making use of maxKeys and prefix, and processing the each dateStr folder one by one. I am wondering If I make use of paginator to get the entire result set, and then sort it, would I run into memory shortage?

tim-finnigan commented 1 year ago

If I make use of paginator to get the entire result set, and then sort it, would I run into memory shortage?

It's possible but that would depend on how much memory is required and how much you have available to allocate.

tim-finnigan commented 1 year ago

The S3 team has this request in their backlog for further review and prioritization. For updates going forward we recommend reaching out through AWS Support if you have a support plan or feel free to check back in here as well.

github-actions[bot] commented 1 year ago

This issue is now closed.

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

mgreiner79 commented 4 months ago

Im Baffled that this feature does not already exist. It seems like a pretty standard thing for any API used for data retrieval that, at the very least, you can pass parameters for sorting the result. If not, and the client want the latest result, and if there are thousands of objects, then the client needs to make many requests to gather all the objects and do the sorting themselves. That can be a lot of traffic and delay time for a pretty routine operation. Please AWS team, do something about this.