s3 sync: files with excluded prefixes checked client side

timmeinerzhagen commented 1 year ago

Describe the bug

We sync files from locally to the root of an S3 bucket. There is an existing folder in that same bucket, which we --exclude for this specific sync - that folder has over 1.800.000 objects.

The sync takes about 16 minutes - while doing nothing other than listing all objects including those from the mentioned excluded folder and checking against the exclude rules.

Expected Behavior

The --exclude option should lead to the excluded paths to be excluded during the call to list the S3 objects.

This, way the exclude check would not need to be performed client-side, but rather server-side.

Current Behavior

The sync lists ALL objects in the entire bucket to then decide on the excludes client-side.

Reproduction Steps

Create bucket test-bucket (choose distinct name)
Add folder test with many random files (e.g. millions)

Run sync to the root of that folder, excluding the test folder

aws s3 sync --delete --debug . s3://test-bucket --exclude "test/*"

Possible Solution

The S3 ListObjectsV2 endpoint that is used does not seem to expose filter options.

Additional Information/Context

No response

CLI version used

2.9.19

Environment details (OS name and version, etc.)

Ubuntu 22.04.1 (on GitHub Actions)

aBurmeseDev commented 1 year ago

Hi @timmeinerzhagen - thanks for reaching out.

I wanted to clarify the few things before further investigation here.

When you mentioned that the command takes about 16 mins, does it actually accomplish the sync or just hang with errors or outputs?
If it just hang, would it be possible for you to provide debug logs for the period of time leading up to where it hangs? Just need a bit more information in order to troubleshoot/reproduce.
Can you also confirm that the sync is between local -> s3?

I attempted to repro the issue with the command below and didn't see the similar behavior. Please note that I probably didn't have as many objects as you have. aws s3 sync . s3://test-bucket/test --exclude "*test/*" --delete

Best, John

timmeinerzhagen commented 1 year ago

@aBurmeseDev Thanks for following up!

When you mentioned that the command takes about 16 mins, does it actually accomplish the sync or just hang with errors or outputs?

The sync itself concluded with a success and it did upload / delete files accordingly. Without debug there were no outputs for most of the time. Only the few actual uploads it would do would be outputted.

If it just hang, would it be possible for you to provide debug logs for the period of time leading up to where it hangs? Just need a bit more information in order to troubleshoot/reproduce.

I can't provide the logs here since it is from a private company project, but I can attach them to my existing AWS Support Case (Case ID 11932289381). Let me know if something needs to be done on my side to grant you access to that.

Sadly, I do not have the full logs when it spend 16 minutes since GitHub Actions does not seem to save rerun logs. I have attached a more recent run from after I pruned many of the 1.800.000 objects to only have 350.000 left - it now only spends about 5 minutes. But it would just be a repeat of the X did not match exclude filter: <filter> and then a List command every so often.

Can you also confirm that the sync is between local -> s3?

That is right, it is copying from local to s3. Rather from a GitHub Actions runner to S3 as can be seen in the logs in the AWS Case.

I attempted to repro the issue with the command below and didn't see the similar behavior. Please note that I probably didn't have as many objects as you have. aws s3 sync . s3://test-bucket/test --exclude "*test/*" --delete

Ah my command example did not accurately represent what the text above that described. It was supposed to be a sync to the root of the bucket while excluding the test folder but my example synced to dest while excluding test - sorry for the incorrect example.

A more accurate test command would be aws s3 sync --delete --debug . s3://test-bucket --exclude "test/*" where test has many files and we are trying to deploy something to the root of the s3 buckt, on the same level as the test folder.

mateuszkwiatkowski commented 1 year ago

Hello,

I experience similar issue. The difference is that I'm copying from remote to local directory:

aws s3 sync  --size-only --exclude '*.png'  s3://test-bucket/ /data/test-bucket/

The test-bucket contains almost 3M .png files in top-level directory that are already copied to local directory. Files are immutable and we would like to just skip them to save time. After issuing this command it spends a lot of time preparing to transfer. Using strace I verified that it executes stat() on all .png files:

[pid 499293] stat("/data/test-bucket/_x8h4645etdh1y.png", {st_mode=S_IFREG|0644, st_size=509374, ...}) = 0
[pid 499293] stat("/data/test-bucket/0k7lhh0pnb55_5.png", {st_mode=S_IFREG|0644, st_size=7204, ...}) = 0
[pid 499293] stat("/data/test-bucket/0k7lhh0pnb55_5.png", {st_mode=S_IFREG|0644, st_size=7204, ...}) = 0
[pid 499293] stat("/data/test-bucket/0k7lhh0pnb55_5.png", {st_mode=S_IFREG|0644, st_size=7204, ...}) = 0

And even openat():

[pid 499293] openat(AT_FDCWD, "/data/test-bucket/mdrictPCJK5-1P80w.png", O_RDONLY) = 6
[pid 499293] ioctl(6, FIOCLEX)          = 0
[pid 499293] fstat(6, {st_mode=S_IFREG|0644, st_size=456, ...}) = 0
[pid 499293] ioctl(6, TCGETS, 0x7ffeeb50fd00) = -1 ENOTTY (Inappropriate ioctl for device)
[pid 499293] lseek(6, 0, SEEK_CUR)      = 0
[pid 499293] close(6)

aBurmeseDev commented 1 year ago

Hi @mateuszkwiatkowski - thanks for reporting and sorry to hear that you're seeing similar issue.

How much time does it usually take to transfer for you and how are you measuring it? Does it happen intermittently? For instances like this, we'd like to ask for debug logs as it would give us more insights on finding the root cause and without logs, it's very limited to how far we can see to help. You could retrieve the logs by adding --debug to your command. Best, John

github-actions[bot] commented 1 year ago

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

timmeinerzhagen commented 1 year ago

The problem still persists with no changes.

Current workaround is using two separate IAM roles. One has access to the entire bucket and one does not have list permissions for the subfolder that contains the large number of objects. That way you can use the restricted role to change the top level items and still access the subfolder via the normal role.

cc: @aBurmeseDev can you reopen the issue, please?

aBurmeseDev commented 3 months ago

Checking in here, is this still an issue? If so, please provide requested info for further investigation.

How much time does it usually take to transfer for you and how are you measuring it? Does it happen intermittently? For instances like this, we'd like to ask for debug logs as it would give us more insights on finding the root cause and without logs, it's very limited to how far we can see to help. You could retrieve the logs by adding --debug to your command.

github-actions[bot] commented 3 months ago

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

timmeinerzhagen commented 2 months ago

@aBurmeseDev The issue still persists when not using the workaround of adjusting permissions so that the excluded files can't be seen. (see https://github.com/aws/aws-cli/issues/7653#issuecomment-1523222650)

I just tried another run with only 100.000 objects in the excluded folder that runs for 2 minutes instead of the 4 seconds when the excluded folder is empty.

How much time does it usually take to transfer for you and how are you measuring it?

For the new attempt in a test environment with 100.000 objects in the excluded folder it takes 1m47 instead of 4 seconds with an empty excluded folder. This is based on GitHub Actions job runtime.

Does it happen intermittently?

This happens every time reliably

For instances like this, we'd like to ask for debug logs as it would give us more insights on finding the root cause and without logs, it's very limited to how far we can see to help. You could retrieve the logs by adding --debug to your command.

Were you able to access the debug logs shared in the AWS Support Case (Case ID 11932289381) as mentioned in https://github.com/aws/aws-cli/issues/7653#issuecomment-1429372142? Since I can only upload 5MB there, I could not share the full logs which would have over the limit even when compressed. Do you have a better way to share logs privately? The new logs are more than the limit as well.

I can already see that it is pretty much a giant wall of this

2024-08-06T14:50:02.9828947Z 2024-08-06 14:50:01,147 - MainThread - awscli.customizations.s3.filters - DEBUG - <s3_bucket>/<path>/<file_1_in_excluded_folder> did not match exclude filter: <filter_1>
2024-08-06T14:50:02.9833965Z 2024-08-06 14:50:01,147 - MainThread - awscli.customizations.s3.filters - DEBUG - <s3_bucket>/<path>/<file_1_in_excluded_folder> matched exclude filter: <filter_2>
2024-08-06T14:50:02.9838769Z 2024-08-06 14:50:01,148 - MainThread - awscli.customizations.s3.filters - DEBUG - =<s3_bucket>/<path>/<file_1_in_excluded_folder> final filtered status, should_include: False
2024-08-06T14:50:02.9843709Z 2024-08-06 14:50:01,148 - MainThread - awscli.customizations.s3.filters - DEBUG - <s3_bucket>/<path>/<file_2_in_excluded_folder> did not match exclude filter: <filter_1>
2024-08-06T14:50:02.9848865Z 2024-08-06 14:50:01,148 - MainThread - awscli.customizations.s3.filters - DEBUG - <s3_bucket>/<path>/<file_2_in_excluded_folder> matched exclude filter: <filter_2>
2024-08-06T14:50:02.9853549Z 2024-08-06 14:50:01,148 - MainThread - awscli.customizations.s3.filters - DEBUG - =<s3_bucket>/<path>/<file_2_in_excluded_folder> final filtered status, should_include: False

The line final filtered status, should_include: False is about 100.000 times in the log file - once for every file in the ignored folder.

Do you need any additional information?

aws / aws-cli