aws s3 ls allows listing by file prefix, but aws s3 sync does not

benjamin-maynard commented 5 years ago

Hi,

I have an S3 Bucket that has a range of files like this:

0000-abcdef-ghiklm-nopqrs
1000-abcdef-ghiklm-nopqrs
2000-abcdef-ghiklm-nopqrs
3000-abcdef-ghiklm-nopqrs
a000-abcdef-ghiklm-nopqrs
b000-abcdef-ghiklm-nopqrs
c000-abcdef-ghiklm-nopqrs
d000-abcdef-ghiklm-nopqrs

Each prefix has millions of files in the bucket.

If I want to list files by a prefix, I can do this really efficiently, like so:

aws s3 ls s3://my-bucket/1 - Will only list files beginning with 1
aws s3 ls s3://my-bucket/2 - Will only list files beginning with 2

This is efficient, as it doesn't return any files without the appropriate prefix. The API handles the list, only returning files that start with 1 or 2

The same isn't possible for the aws s3 sync command.

aws s3 sync s3://my-bucket/1* s3://my-new-bucket/ or aws s3 sync s3://my-bucket/1 s3://my-new-bucket/ does not work.

The only way to do it is through adding a --exclude --include parameter for example aws s3 sync s3://my-bucket/ s3://my-new-bucket/ --exclude "*" --include "1*"

But this is really inefficient, as it lists the whole contents of the bucket, and filters each item to determine whether it should match client side. This method makes perfect sense for normal include exclude operations like --exclude "*" --include "*.txt" but for prefixed files it is inefficient.

I know that folder/path prefixes should be used, but that isn't always possible. It is clear that the aws s3 ls command has the potential to return only files with a specific prefix, so could this behaviour be added into the sync command? Otherwise syncs between buckets take weeks instead of hours.

benjamin-maynard commented 5 years ago

@JordonPhillips Any views on this?

inopinatus commented 5 years ago

I have this problem also, with a CloudFront log bucket containing hundreds of thousands of files, and --include and --exclude are so slow as to be unusable.

kyleknap commented 5 years ago

Marking it as a feature request. To support that style of syncing we would need to add a flag to sync. Otherwise, it would also make sense to make the --include and --exclude flags more efficient in terms of determining what we actually list in the bucket.

benjamin-maynard commented 5 years ago

Hi @kyleknap - I'd say this should be default functionality, and instead should be flagged as a bug.

aws s3 ls s3://my-bucket/1 will only return files beginning with 1

I therefore think we should expect the same with aws s3 sync s3://my-bucket/1 s3://my-bucket-2/ or aws s3 cp --recursive s3://my-bucket/1 s3://my-bucket-2/

--include and --exclude should be left as is, as while inefficient, they enable more complex patterns like *.jpg etc.

Be interested to know your thoughts.

TheWildHorse commented 4 years ago

Did anyone find any solution for this? I am having huge problems with the client-side handling of this.

dongting commented 4 years ago

Not sure if this has been fixed, but mine works in what seems to be the same scenario:

$ aws --version
aws-cli/1.18.28 Python/3.6.8 Darwin/19.3.0 botocore/1.15.28
$ aws s3 sync s3://<bucket1>/<prefix1>/test_data/ s3://<bucket2>/
copy: s3://<bucket1>/<prefix1>/test_data/test-file-1.txt to s3://<bucket2>/test-file-1.txt
copy: s3://<bucket1>/<prefix1>/test_data/test-file-2.txt to s3://<bucket2>/test-file-2.txt
copy: s3://<bucket1>/<prefix1>/test_data/test_folder/test-file-3.txt to s3://<bucket2>/test_folder/test-file-3.txt

This also worked for me (note the change from test_data to test_data2:

$ aws s3 sync s3://<bucket1>/<prefix1>/test_data/ s3://<bucket2>/<prefix1>/test_data2/
copy: s3://<bucket1>/<prefix1>/test_data/test-file-1.txt to s3://<bucket2>/<prefix1>/test_data2/test-file-1.txt
copy: s3://<bucket1>/<prefix1>/test_data/test-file-2.txt to s3://<bucket2>/<prefix1>/test_data2/test-file-2.txt
copy: s3://<bucket1>/<prefix1>/test_data/test_folder/test-file-3.txt to s3://<bucket2>/<prefix1>/test_data2/test_folder/test-file-3.txt

lucasbasquerotto commented 4 years ago

@dongting It seems yours is not the same scenario, as you are syncing the entire folder, instead of using a file prefix in the folder, like aws s3 sync s3://<bucket1>/<prefix1>/test_data/test-file- s3://<bucket2>/.

Regarding the initial issue, I would prefer sync to use prefix with a wildcard (like s3://<bucket1>/file-prefix-*) instead of just the prefix like ls does (s3://<bucket1>/file-prefix-) for 2 reasons:

1. It would be a breaking change for anyone that already uses sync and expects the current behavior.

2. It makes explicit that a prefix is being used (sync is normally a more dangerous action than ls, and syncing all files with a prefix when you expect to sync only the exact match could have very unwanted consequences).

Although if the efficiency of --exclude "*" --include "prefix*" start to match the efficiency of ls (in this specific case, in which the wildcard is at the end of include), it would be fine too (without having to change the way it's done currently).

bes1002t commented 4 years ago

I think we have two issues here.

You want to sync files using a prefix. Solution: allow to append parital paths or prefixes to bucket url
There is no way to bulk sync multiple files passing multiple paths to the sync command. You have to call sync several times to sync all files you want. This is very inefficient. Solution: add command --includePath

yarikoptic commented 1 year ago

and moreover -- it is just silently does nothing and does not exit with non-0 exit code... uff

❯ aws s3 ls s3://dandiarchive-logs/2023-04-24-19-15-3
2023-04-24 15:15:31      91377 2023-04-24-19-15-30-02B638249E7D6A3C
2023-04-24 15:15:33      81438 2023-04-24-19-15-32-42D8F376DAE3D09C
2023-04-24 15:15:37        579 2023-04-24-19-15-36-2715E307EF06A155
❯ aws s3 sync s3://dandiarchive-logs/2023-04-24-19-15-3 .
❯ echo $?
0

FWIW, s3cmd seems to work fine

❯ s3cmd -c ~/.s3cfg-dandi-backup sync  s3://dandiarchive-logs/2023-04-24-19-15-3 .
download: 's3://dandiarchive-logs/2023-04-24-19-15-30-02B638249E7D6A3C' -> './2023-04-24-19-15-30-02B638249E7D6A3C'  [1 of 3]
 91377 of 91377   100% in    0s   442.15 KB/s  done
download: 's3://dandiarchive-logs/2023-04-24-19-15-32-42D8F376DAE3D09C' -> './2023-04-24-19-15-32-42D8F376DAE3D09C'  [2 of 3]
 81438 of 81438   100% in    0s   694.83 KB/s  done
download: 's3://dandiarchive-logs/2023-04-24-19-15-36-2715E307EF06A155' -> './2023-04-24-19-15-36-2715E307EF06A155'  [3 of 3]
 579 of 579   100% in    0s     7.81 KB/s  done
Done. Downloaded 173394 bytes in 1.0 seconds, 169.33 KB/s.
❯ ls
2023-04-24-19-15-30-02B638249E7D6A3C*  2023-04-24-19-15-32-42D8F376DAE3D09C*  2023-04-24-19-15-36-2715E307EF06A155*

yarikoptic commented 1 year ago

--exclude / --include dance to solve the issue "works" in that it achieves the goal but VERY inefficient since I believe it is done client side

❯ time aws s3 sync --exclude=* --include 2023-04-24-19-15-3* s3://dandiarchive-logs/ .
download: s3://dandiarchive-logs/2023-04-24-19-15-36-2715E307EF06A155 to ./2023-04-24-19-15-36-2715E307EF06A155
download: s3://dandiarchive-logs/2023-04-24-19-15-30-02B638249E7D6A3C to ./2023-04-24-19-15-30-02B638249E7D6A3C
download: s3://dandiarchive-logs/2023-04-24-19-15-32-42D8F376DAE3D09C to ./2023-04-24-19-15-32-42D8F376DAE3D09C

aws s3 sync --exclude=* --include 2023-04-24-19-15-3* s3://dandiarchive-logs/  1158.25s user 9.73s system 52% cpu 36:44.04 total

so it took 36 minutes to complete this sync of 3 keys, and I saw network being busy so I guess it was fetching full listing and doing local selection.

palmobar commented 1 month ago

Sorry for waking up this thread but --include --exclude seem to be extremely inefficient

aws / aws-cli

aws s3 ls allows listing by file prefix, but aws s3 sync does not #4240