Open benjamin-maynard opened 5 years ago
@JordonPhillips Any views on this?
I have this problem also, with a CloudFront log bucket containing hundreds of thousands of files, and --include and --exclude are so slow as to be unusable.
Marking it as a feature request. To support that style of syncing we would need to add a flag to sync. Otherwise, it would also make sense to make the --include
and --exclude
flags more efficient in terms of determining what we actually list in the bucket.
Hi @kyleknap - I'd say this should be default functionality, and instead should be flagged as a bug.
aws s3 ls s3://my-bucket/1
will only return files beginning with 1
I therefore think we should expect the same with aws s3 sync s3://my-bucket/1 s3://my-bucket-2/
or aws s3 cp --recursive s3://my-bucket/1 s3://my-bucket-2/
--include
and --exclude
should be left as is, as while inefficient, they enable more complex patterns like *.jpg
etc.
Be interested to know your thoughts.
Did anyone find any solution for this? I am having huge problems with the client-side handling of this.
Not sure if this has been fixed, but mine works in what seems to be the same scenario:
$ aws --version
aws-cli/1.18.28 Python/3.6.8 Darwin/19.3.0 botocore/1.15.28
$ aws s3 sync s3://<bucket1>/<prefix1>/test_data/ s3://<bucket2>/
copy: s3://<bucket1>/<prefix1>/test_data/test-file-1.txt to s3://<bucket2>/test-file-1.txt
copy: s3://<bucket1>/<prefix1>/test_data/test-file-2.txt to s3://<bucket2>/test-file-2.txt
copy: s3://<bucket1>/<prefix1>/test_data/test_folder/test-file-3.txt to s3://<bucket2>/test_folder/test-file-3.txt
This also worked for me (note the change from test_data
to test_data2
:
$ aws s3 sync s3://<bucket1>/<prefix1>/test_data/ s3://<bucket2>/<prefix1>/test_data2/
copy: s3://<bucket1>/<prefix1>/test_data/test-file-1.txt to s3://<bucket2>/<prefix1>/test_data2/test-file-1.txt
copy: s3://<bucket1>/<prefix1>/test_data/test-file-2.txt to s3://<bucket2>/<prefix1>/test_data2/test-file-2.txt
copy: s3://<bucket1>/<prefix1>/test_data/test_folder/test-file-3.txt to s3://<bucket2>/<prefix1>/test_data2/test_folder/test-file-3.txt
@dongting It seems yours is not the same scenario, as you are syncing the entire folder, instead of using a file prefix in the folder, like aws s3 sync s3://<bucket1>/<prefix1>/test_data/test-file- s3://<bucket2>/
.
Regarding the initial issue, I would prefer sync
to use prefix with a wildcard (like s3://<bucket1>/file-prefix-*
) instead of just the prefix like ls
does (s3://<bucket1>/file-prefix-
) for 2 reasons:
1. It would be a breaking change for anyone that already uses sync
and expects the current behavior.
2. It makes explicit that a prefix is being used (sync
is normally a more dangerous action than ls
, and syncing all files with a prefix when you expect to sync only the exact match could have very unwanted consequences).
Although if the efficiency of --exclude "*" --include "prefix*"
start to match the efficiency of ls
(in this specific case, in which the wildcard is at the end of include
), it would be fine too (without having to change the way it's done currently).
I think we have two issues here.
You want to sync files using a prefix. Solution: allow to append parital paths or prefixes to bucket url
There is no way to bulk sync multiple files passing multiple paths to the sync command. You have to call sync several times to sync all files you want. This is very inefficient. Solution: add command --includePath
and moreover -- it is just silently does nothing and does not exit with non-0 exit code... uff
❯ aws s3 ls s3://dandiarchive-logs/2023-04-24-19-15-3
2023-04-24 15:15:31 91377 2023-04-24-19-15-30-02B638249E7D6A3C
2023-04-24 15:15:33 81438 2023-04-24-19-15-32-42D8F376DAE3D09C
2023-04-24 15:15:37 579 2023-04-24-19-15-36-2715E307EF06A155
❯ aws s3 sync s3://dandiarchive-logs/2023-04-24-19-15-3 .
❯ echo $?
0
FWIW, s3cmd
seems to work fine
❯ s3cmd -c ~/.s3cfg-dandi-backup sync s3://dandiarchive-logs/2023-04-24-19-15-3 .
download: 's3://dandiarchive-logs/2023-04-24-19-15-30-02B638249E7D6A3C' -> './2023-04-24-19-15-30-02B638249E7D6A3C' [1 of 3]
91377 of 91377 100% in 0s 442.15 KB/s done
download: 's3://dandiarchive-logs/2023-04-24-19-15-32-42D8F376DAE3D09C' -> './2023-04-24-19-15-32-42D8F376DAE3D09C' [2 of 3]
81438 of 81438 100% in 0s 694.83 KB/s done
download: 's3://dandiarchive-logs/2023-04-24-19-15-36-2715E307EF06A155' -> './2023-04-24-19-15-36-2715E307EF06A155' [3 of 3]
579 of 579 100% in 0s 7.81 KB/s done
Done. Downloaded 173394 bytes in 1.0 seconds, 169.33 KB/s.
❯ ls
2023-04-24-19-15-30-02B638249E7D6A3C* 2023-04-24-19-15-32-42D8F376DAE3D09C* 2023-04-24-19-15-36-2715E307EF06A155*
--exclude / --include
dance to solve the issue "works" in that it achieves the goal but VERY inefficient since I believe it is done client side
❯ time aws s3 sync --exclude=* --include 2023-04-24-19-15-3* s3://dandiarchive-logs/ .
download: s3://dandiarchive-logs/2023-04-24-19-15-36-2715E307EF06A155 to ./2023-04-24-19-15-36-2715E307EF06A155
download: s3://dandiarchive-logs/2023-04-24-19-15-30-02B638249E7D6A3C to ./2023-04-24-19-15-30-02B638249E7D6A3C
download: s3://dandiarchive-logs/2023-04-24-19-15-32-42D8F376DAE3D09C to ./2023-04-24-19-15-32-42D8F376DAE3D09C
aws s3 sync --exclude=* --include 2023-04-24-19-15-3* s3://dandiarchive-logs/ 1158.25s user 9.73s system 52% cpu 36:44.04 total
so it took 36 minutes to complete this sync of 3 keys, and I saw network being busy so I guess it was fetching full listing and doing local selection.
Sorry for waking up this thread but --include --exclude seem to be extremely inefficient
Hi,
I have an S3 Bucket that has a range of files like this:
Each prefix has millions of files in the bucket.
If I want to list files by a prefix, I can do this really efficiently, like so:
aws s3 ls s3://my-bucket/1
- Will only list files beginning with1
aws s3 ls s3://my-bucket/2
- Will only list files beginning with2
This is efficient, as it doesn't return any files without the appropriate prefix. The API handles the list, only returning files that start with
1
or2
The same isn't possible for the
aws s3 sync
command.aws s3 sync s3://my-bucket/1* s3://my-new-bucket/
oraws s3 sync s3://my-bucket/1 s3://my-new-bucket/
does not work.The only way to do it is through adding a
--exclude
--include
parameter for exampleaws s3 sync s3://my-bucket/ s3://my-new-bucket/ --exclude "*" --include "1*"
But this is really inefficient, as it lists the whole contents of the bucket, and filters each item to determine whether it should match client side. This method makes perfect sense for normal include exclude operations like
--exclude "*" --include "*.txt"
but for prefixed files it is inefficient.I know that folder/path prefixes should be used, but that isn't always possible. It is clear that the
aws s3 ls
command has the potential to return only files with a specific prefix, so could this behaviour be added into the sync command? Otherwise syncs between buckets take weeks instead of hours.