aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.59k stars 4.14k forks source link

`aws s3 sync` downloading unchanged files. #7228

Closed MrJoy closed 1 year ago

MrJoy commented 2 years ago

Describe the bug

I have a maintenance script I run to keep a local copy of billing & usage data for my personal AWS account. It's identifying almost every file as changed, on every run even though most of the files haven't been modified in years.

Expected Behavior

Only changed files -- in this case, files representing the current billing period -- should be downloaded.

Current Behavior

Of 6,279 files that do not represent the current billing period, it's consistently re-downloading 5,831 of them. The files it downloads are, byte-for-byte identical to the existing ones. I spot-checked one of the files, and aws s3 ls reports the exact same size and timestamp as ls does.

Reported by aws s3 sync:

download: s3://mrjoy-billing-data//cur//billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz to ../personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz

Reported by aws s3 ls:

% aws-vault exec mrjoy -- aws s3 ls s3://mrjoy-billing-data//cur//billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz
2021-01-22 15:53:24     296522 billing_and_usage-00001.csv.gz

Reported by ls:

% ls -laD "%Y-%m-%d %H:%M:%S" ~/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz
-rw-r--r--  1 jonathonfrisby  staff  296522 2021-01-22 15:53:24 /Users/jonathonfrisby/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz

The post-fetch commit in all cases shows diffs for the files in the current billing period (as would be expected), and no changes to any of the other files that aws s3 sync reports as being downloaded.

All told, aws s3 sync appears to be downloading around 700MB of files on each run that it shouldn't be.

Reproduction Steps

The relevant portion of my script is:

#!/bin/bash
IFS=$'\n\t'
set -euo pipefail

(
  cd ~/personal
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, post-fetch..."
)

The data in the bucket is written by AWS itself.

Possible Solution

No response

Additional Information/Context

No response

CLI version used

2.7.26

Environment details (OS name and version, etc.)

macOS 12.5.1

tim-finnigan commented 2 years ago

Hi @MrJoy thanks for reaching out. Have you tried using the --size-only parameter documented here? This parameter makes the size of each key the only criteria used to decide whether to sync from source to destination. So it should ignore all of those files that are the same size.

MrJoy commented 2 years ago

I attempted it just now, and it did not change anything. It's still attempting to download ~everything. Note that after submitting this ticket, I updated to 2.7.27 -- so this test was done on 2.7.27 not 2.7.26.

tim-finnigan commented 2 years ago

Thanks for the update. There is an older issue tracking problems with S3 sync here: https://github.com/aws/aws-cli/issues/599. Some users have reported anomalies when certain files sync that should not, but I wouldn't expect the problem at the scale you're describing where it's happening with hundreds of files. I don't know if I'd be able to reproduce the issue as described but could try. If you can get the debug logs by adding --debug to the command that might also give more insight into the problem. Some have said that using --size-only or --exact-timestamps has helped produce the expected results. There are other S3 sync-related feature requests like https://github.com/aws/aws-cli/issues/6750 that relate to using new checksum algorithms for improving the accuracy.

MrJoy commented 2 years ago

@tim-finnigan I'm sorry, I was unclear in my last message: When I said "I attempted it just now", I meant "I attempted to use --size-only just now". Accidental pronoun game, FTL.

If you'd like, I can temporarily give you read-only credentials for this bucket and you can see if you are able to recreate the problem from the same source. My personal AWS bill is... not data I'm terribly worried about sharing.

I'll get --debug output and add it here tomorrow. I'm about to be AFK for a while.

MrJoy commented 2 years ago

debug.log.cleansed.zip

I've stripped tokens/signatures/key IDs from the file, but it's otherwise as produced from running:

aws-vault exec mrjoy -- aws s3 sync --debug --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
tim-finnigan commented 2 years ago

Thanks @MrJoy for following up and sharing your logs. I couldn't identify any anomalies after scanning through the logs. I think attempting to recreate the issue is a good idea, but for that I recommend reaching out through AWS Support to open a private communication channel. I'd also recommend trying to use --exact-timestamps when running the sync command to see if that addresses the issue you're seeing.

MrJoy commented 2 years ago

I went ahead and tried --exact-timestamps by itself and in combination with --size-only, and the behavior seems to be the same in all cases.

Going through AWS Support is not an option, as this is my personal account and I'm on the Basic plan.

tim-finnigan commented 2 years ago

Checking in on this issue again - thanks for your patience. I think this issue might actually overlap with https://github.com/aws/aws-cli/issues/5730, https://github.com/aws/aws-cli/issues/648 and/or https://github.com/aws/aws-cli/issues/5369. Have you looked through any of those issues? Based on some of the comments it sounds like this could be due to how S3 handles timestamps.

MrJoy commented 2 years ago

Using --size-only, without --exact-timestamps does not alleviate the problem. Doesn't --size-only cause aws-cli to disregard timestamps?

tim-finnigan commented 1 year ago

Hi again, thanks for your patience, I lost track of this issue. Per the s3 sync documentation --size-only does the following:

--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

And --exact-timestamps does the following:

--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.

Do you have any updates on your end as far as what you've tried? I still can't reproduce the issue but invite others to share their insights here if they know what the problem could be.

MrJoy commented 1 year ago

The totality of my script is, at present, this:

#!/bin/bash
IFS=$'\n\t'
set -euo pipefail

(
  cd ~/personal
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, post-fetch..."
)

(
  cd ~/mjbackup/aws
  git add .
  git commit --all --allow-empty -m "AWS log snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-logs/ ~/mjbackup/aws/access/
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-api-logs/ ~/mjbackup/aws/api/
  git add .
  git commit --all --allow-empty -m "AWS log snapshot, post-fetch..."
)

echo 'Done.'

As of today, that first sync job has an issue and the other two do not. So the problem is clearly dependent upon the data in S3 and/or my local filesystem.

In the case of the first sync job, it's notably that only the cur sub-directory is affected -- and every single object under there is affected. There's about 58MB of files that sit parallel to the cur folder of the bucket and they do not get re-synced on every run. The 744.4MB of files under cur are re-synced every single time, with no changes resulting.

Currently, I'm using aws-cli version:

aws-cli/2.11.0 Python/3.11.2 Darwin/21.6.0 source/arm64 prompt/off

I'm doing a test real quick to have the first sync happen to a different folder, so I can see if it's something to do with the local FS side of things. Will post results momentarily.

I'm happy to give you temporary access to that bucket so you can see if that's helpful in reproducing the issue.

MrJoy commented 1 year ago

(Just to clarify: When I say no changes result, I mean I wind up with an empty commit despite aws-cli downloading 744.4MB of data.)

MrJoy commented 1 year ago

Ok. Re-running (twice) against a clean sub-folder produces the same behavior of the data being re-synced. So it seems to be either an issue on the S3 side, not something related to how the data (originally) got stored on disk locally.

MrJoy commented 1 year ago
% ls -la ~/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json
-rw-r--r--  1 jonathonfrisby  staff  6458 Jan 23  2021 /Users/jonathonfrisby/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json

An example of the details of one object that's getting re-synced.

MrJoy commented 1 year ago

@tim-finnigan Would it be helpful if I gave you access to the relevant S3 bucket?

MrJoy commented 1 year ago

I've identified the problem. I had AWS configured to put billing and usage reports under the prefix "/cur/". That got interpreted as a directory entry named "/" holding a directory entry named "cur" holding a directory entry named "/".

After I corrected the prefix, and moved things out of the "/" folders, the sync process shows no changes.

Screen Shot 2023-08-17 at 20 28 33
github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.