Closed MrJoy closed 1 year ago
Hi @MrJoy thanks for reaching out. Have you tried using the --size-only
parameter documented here? This parameter makes the size of each key the only criteria used to decide whether to sync from source to destination. So it should ignore all of those files that are the same size.
I attempted it just now, and it did not change anything. It's still attempting to download ~everything. Note that after submitting this ticket, I updated to 2.7.27 -- so this test was done on 2.7.27 not 2.7.26.
Thanks for the update. There is an older issue tracking problems with S3 sync here: https://github.com/aws/aws-cli/issues/599. Some users have reported anomalies when certain files sync that should not, but I wouldn't expect the problem at the scale you're describing where it's happening with hundreds of files. I don't know if I'd be able to reproduce the issue as described but could try. If you can get the debug logs by adding --debug
to the command that might also give more insight into the problem. Some have said that using --size-only
or --exact-timestamps
has helped produce the expected results. There are other S3 sync-related feature requests like https://github.com/aws/aws-cli/issues/6750 that relate to using new checksum algorithms for improving the accuracy.
@tim-finnigan I'm sorry, I was unclear in my last message: When I said "I attempted it just now", I meant "I attempted to use --size-only
just now". Accidental pronoun game, FTL.
If you'd like, I can temporarily give you read-only credentials for this bucket and you can see if you are able to recreate the problem from the same source. My personal AWS bill is... not data I'm terribly worried about sharing.
I'll get --debug
output and add it here tomorrow. I'm about to be AFK for a while.
I've stripped tokens/signatures/key IDs from the file, but it's otherwise as produced from running:
aws-vault exec mrjoy -- aws s3 sync --debug --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
Thanks @MrJoy for following up and sharing your logs. I couldn't identify any anomalies after scanning through the logs. I think attempting to recreate the issue is a good idea, but for that I recommend reaching out through AWS Support to open a private communication channel. I'd also recommend trying to use --exact-timestamps
when running the sync command to see if that addresses the issue you're seeing.
I went ahead and tried --exact-timestamps
by itself and in combination with --size-only
, and the behavior seems to be the same in all cases.
Going through AWS Support is not an option, as this is my personal account and I'm on the Basic plan.
Checking in on this issue again - thanks for your patience. I think this issue might actually overlap with https://github.com/aws/aws-cli/issues/5730, https://github.com/aws/aws-cli/issues/648 and/or https://github.com/aws/aws-cli/issues/5369. Have you looked through any of those issues? Based on some of the comments it sounds like this could be due to how S3 handles timestamps.
Using --size-only
, without --exact-timestamps
does not alleviate the problem. Doesn't --size-only
cause aws-cli
to disregard timestamps?
Hi again, thanks for your patience, I lost track of this issue. Per the s3 sync documentation --size-only
does the following:
--size-only
(boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
And --exact-timestamps
does the following:
--exact-timestamps
(boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
Do you have any updates on your end as far as what you've tried? I still can't reproduce the issue but invite others to share their insights here if they know what the problem could be.
The totality of my script is, at present, this:
#!/bin/bash
IFS=$'\n\t'
set -euo pipefail
(
cd ~/personal
git add .
git commit --all --allow-empty -m "AWS bill snapshot, pre-fetch..."
aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
git add .
git commit --all --allow-empty -m "AWS bill snapshot, post-fetch..."
)
(
cd ~/mjbackup/aws
git add .
git commit --all --allow-empty -m "AWS log snapshot, pre-fetch..."
aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-logs/ ~/mjbackup/aws/access/
aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-api-logs/ ~/mjbackup/aws/api/
git add .
git commit --all --allow-empty -m "AWS log snapshot, post-fetch..."
)
echo 'Done.'
As of today, that first sync job has an issue and the other two do not. So the problem is clearly dependent upon the data in S3 and/or my local filesystem.
In the case of the first sync job, it's notably that only the cur
sub-directory is affected -- and every single object under there is affected. There's about 58MB of files that sit parallel to the cur
folder of the bucket and they do not get re-synced on every run. The 744.4MB of files under cur
are re-synced every single time, with no changes resulting.
Currently, I'm using aws-cli version:
aws-cli/2.11.0 Python/3.11.2 Darwin/21.6.0 source/arm64 prompt/off
I'm doing a test real quick to have the first sync happen to a different folder, so I can see if it's something to do with the local FS side of things. Will post results momentarily.
I'm happy to give you temporary access to that bucket so you can see if that's helpful in reproducing the issue.
(Just to clarify: When I say no changes result, I mean I wind up with an empty commit despite aws-cli downloading 744.4MB of data.)
Ok. Re-running (twice) against a clean sub-folder produces the same behavior of the data being re-synced. So it seems to be either an issue on the S3 side, not something related to how the data (originally) got stored on disk locally.
% ls -la ~/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json
-rw-r--r-- 1 jonathonfrisby staff 6458 Jan 23 2021 /Users/jonathonfrisby/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json
An example of the details of one object that's getting re-synced.
@tim-finnigan Would it be helpful if I gave you access to the relevant S3 bucket?
I've identified the problem. I had AWS configured to put billing and usage reports under the prefix "/cur/". That got interpreted as a directory entry named "/" holding a directory entry named "cur" holding a directory entry named "/".
After I corrected the prefix, and moved things out of the "/" folders, the sync process shows no changes.
Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
Describe the bug
I have a maintenance script I run to keep a local copy of billing & usage data for my personal AWS account. It's identifying almost every file as changed, on every run even though most of the files haven't been modified in years.
Expected Behavior
Only changed files -- in this case, files representing the current billing period -- should be downloaded.
Current Behavior
Of 6,279 files that do not represent the current billing period, it's consistently re-downloading 5,831 of them. The files it downloads are, byte-for-byte identical to the existing ones. I spot-checked one of the files, and
aws s3 ls
reports the exact same size and timestamp asls
does.Reported by
aws s3 sync
:Reported by
aws s3 ls
:Reported by
ls
:The
post-fetch
commit in all cases shows diffs for the files in the current billing period (as would be expected), and no changes to any of the other files thataws s3 sync
reports as being downloaded.All told,
aws s3 sync
appears to be downloading around 700MB of files on each run that it shouldn't be.Reproduction Steps
The relevant portion of my script is:
The data in the bucket is written by AWS itself.
Possible Solution
No response
Additional Information/Context
No response
CLI version used
2.7.26
Environment details (OS name and version, etc.)
macOS 12.5.1