aws-samples / aws-big-data-blog-dmscdc-walkthrough

MIT No Attribution
32 stars 16 forks source link

list_objects 1000 limit #5

Closed heyhughes closed 2 years ago

heyhughes commented 3 years ago

Any thought to the 1000 limit with list_objects? We're finding that we have more than 1000 objects with the particular prefix (DMS syncing a volatile table). The Controller python uses s3conn.list_objects to identify if any new files need to be incrementally loaded. If you have over 1000 2*.parquet files the incremental job will never kick off.

sheridan06 commented 3 years ago

@heyhughes you can change it to the new list_objects_v2 and use the 'start after' property. You can lookup the last processed file from the Dynamo controller table, and have list_objects_v2 start after that file in S3. this way, you then get the next 1000 that haven't been processed. Worked like a charm for me

heyhughes commented 3 years ago

That did the trick! Thank you @sheridan06

rjvgupta commented 2 years ago

I just updated the code with this suggestion.