aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.35k stars 4.09k forks source link

aws s3 sync --exclude (still) inconsistent #1588

Open drzraf opened 8 years ago

drzraf commented 8 years ago
$ cd ~
$ aws s3 sync / s3://bucket/test/ --exclude "bin/*"  --dryrun
$ aws s3 sync / s3://bucket/test/ --exclude "/bin/*"  --dryrun
$ aws s3 sync / s3://bucket/test/ --exclude "./bin/*"  --dryrun
$ aws s3 sync / s3://bucket/test/ --exclude "/bin/*"  --dryrun
$ cd /
$ aws s3 sync / s3://bucket/test/ --exclude "bin/*"  --dryrun
$ aws s3 sync / s3://bucket/test/ --exclude "/bin/*"  --dryrun
$ aws s3 sync / s3://bucket/test/ --exclude "./bin/*"  --dryrun
$ aws s3 sync / s3://bucket/test/ --exclude "/bin/*"  --dryrun

None of the above commands (whatever cwd is) will exclude /bin from being synch'ed Using "*/bin/*" would work, but it's unwanted since it would exclude other subdirectories like home/foo/comp/bin/

mtdowling commented 8 years ago

I played around with this, and I too see that the --exclude behavior is not working well (this possibly has to do with the fact that the root folder is being used).

thedukeness commented 8 years ago

Perhaps this is related, I am using aws-cli/1.10.0 and cannot ever get a file like this to be included or excluded: ._3E53F853-DA82-4926-AC21-5C1096FB126C.MP3

aws s3 rm s3://foo/foo2/foo3 --include "*/._*" --exclude "*" --recursive --dryrun
aws s3 rm s3://foo/foo2/foo3 --include "*._*" --exclude "*" --recursive --dryrun
aws s3 sync /foo/foo2/foo3 s3://foo/foo2/foo3 --exclude "*/._*" --delete --dryrun
kirkmadera commented 8 years ago

According to #548, it seems as though the behavior is to prepend each exclude/include pattern with the current working directory. I was able to verify this behavior with this test

Given the directory structure:

test1.txt
test2/
    test2.txt
    test3/
        test3.txt

aws s3 sync . s3://examplebucket --exclude "test2/test3/*" test3.txt is excluded

aws s3 sync . s3://examplebucket --exclude "test3/*" test3.txt is not excluded

So, it seems as though paths are always relative to the root of the sync which is a bit unintuitive in my opinion. I think most users would prefer rsync style syntax, where you must prepend the pattern with "/" in order for it to be relative to the root of the sync.

thedukeness commented 8 years ago

Great detective work kirkmadera. That is quite unintuitive but at least now I can understand what is happening.

elondaits commented 8 years ago

I'm using it exactly like that (except I use many "exclude" statements one after the other) and still it won't honor the exclusion.

nkadel-skyhook commented 8 years ago

It's not just "relative paths". If I have mixed data files, such as this, I cannot effectively exclude one while including the other:

       dirname/log.1
       dirname/log.2
       dirname/log.3

       dirname/log.foo.1
       dirname/log.foo.2
       dirname/log.foo.3

As soon as I use any "--include=" statement that would include both, no combination of arguments allows me to exclude the other logs. Combined with the non-intuitive behavior of "--exclude=*" having to come before "--include=[whatever]", it makes for a very confusing operations.

      aws s3 sync --exclude=* --include=log.* --exclulde=log.foo.* dirname/ s3://bucket/dirname/
bithive commented 6 years ago

This is has been a major flaw with aws s3 sync for years, there is no way to exclude dotfiles such as .DS_Store or directories such as .git or .DAV. I have spent hours trying different invocations and the --exclude argument appears to have no effect at all.

As others have stated, the --exclude flag should work as it does in rsync. We need to be able to easily indicate file and directory names for exclusion anywhere they might appear in the path being synchronized.

brownjd commented 6 years ago

It would be great if you could just follow the rsync conventions. Bonus if it would allow you to specify a rsync-style filter file.

cjeanneret commented 6 years ago

"happy" to see I'm not the only one getting beaten to death by this option… It really does hurt.

pyrtsa commented 6 years ago

I'm also seeing this bug with aws s3 cp --recursive --exclude '.*' localpath s3://bucket/path. The trouble is I'm trying to exclude a file that's unreadable to the current user but regardless of the exclude patterns the file is still attempted (in a way).

The core problem is that unreadable files turn into warnings before the --exclude filter is applied.

The file generator and filter instructions are inserted one after another here. When the file generator runs to list available files in the hierarchy, it triggers warnings for unreadable files completely unaware of the further filtering stage that would skip those files altogether.

I suggest maybe FileGenerator should not call triggers_warning on the happy code path (like it does in should_ignore_file), but instead either there needs to be an extra instruction stage to check for file warnings only after filtering, or else FileGenerator and Filter should merge into just one instruction.

aemc commented 5 years ago

Can verify it's still an issue...

droidmonkey commented 5 years ago

This is not so much an issue in the code, more so that it is poorly documented. The documentation needs to explicitly state what is in this comment:

https://github.com/aws/aws-cli/blob/8b61d2d982cc43be22678948c8fdc8b4eb652631/awscli/customizations/s3/utils.py#L117-L119

No where in the documentation does it state that the position of the exclude and include filters are dependent on the success of the filtering operation. For example:

aws s3 cp mydir s3://mybucket --recursive --exclude "*" --exclude "*/*" --include "*.txt" Results in copying all txt files in subdirectories since the include filter matches last and will match subdirectories which overrides the exclude of "/".

aws s3 cp mydir s3://mybucket --recursive --exclude "*" --include "*.txt" --exclude "*/*" Works as intended, subdirectories are excluded since the last match will be FALSE.

My main problem with the way the code is implemented is that there is no way to stop recursion into a directory that is excluded. This results in totally unnecessary deep recursion and long run times. There could be an option such as --excludestop that if matches false skips any further checks of the directory tree.

scrobby commented 5 years ago

Also having this issue when trying to exclude any files beginning with ._

I'm trying to use this to sync Avid projects to a server, using the following command: for f in /Volumes/*Projects; do /usr/local/bin/aws s3 sync $f s3://[companyname]-avid.backups/${f\/Volumes\//}/ --exclude "*.lck" --exclude "._*" --exclude "*/SearchData/*" --exclude "*/WaveformCache/*" --exclude "*/Unity Attic/*"; done

All the other filters work as expected, but it's still uploading any files starting with ._, which is irritating as all of those are just dud files Avid generates when using shared storage

bithive commented 5 years ago

At this point I am convinced that the reason AWS provides defective tools and does not fix them is that S3 and Glacier are more (most?) profitable when people use them to store tons of files which are smaller than the minimum billable increment. Bummer, but not too surprising.

droidmonkey commented 5 years ago

@bithive considering the tool is 100% open source, and you could modify it yourself to fix observed issues, I absolutely cannot agree with your thesis.

Hyurt commented 5 years ago

One thing we should keep in mind and mentioned here and referring to this issue:

For that, I used the --exclude="src/*" parameter (I also make sure the aws command is called from my $HOME directory, since I learned that the filters start matching from the current directory -- more details can be found in the #1588 issue).

Maybe using something like **/.* would work in your cases ? I use it in a cron job and hidden files aren't sent to S3. Hoping it would help in your case

diegojancic commented 4 years ago

This is terribly confusing, but I kind of figured it out. I recommend you running with the Debug and Dryrun flags to see what's going on, for example:

aws s3 sync <source> <dest> --debug --dryrun --exclude ...

In my case: aws s3 sync D:\test\ s3://bucket/ --dryrun --debug --exclude "Baks/*"

You will see something like this:

awscli.customizations.s3.filters - DEBUG - d:\test\Baks\1.rtf matched exclude 
filter: d:\test\Baks\*

That means that for each exclude, the origin path is prepended and that is used to match the destination path.

Now, if I run the same on the root of the drive: aws s3 sync D:\ s3://bucket/ --dryrun --debug --exclude "Baks*"

That's where it gets complicated. The reason is that the path being matched will have 2 backslashes, for example: d:\\Baks\.... did not match clude filter: d:\Baks*

From my tests:

  1. If the filter is //folder*, then the path is matched against d:\\folder* and \\folder*, which will match.
  2. If the filter is //folder/* then the path is matched against \\folder\* and \\folder\* (yes, 2 times the exact same pattern), which will not match.
  3. If you don't include the \\ or // at the beginning, it won't match as the path will contain them.
  4. Another option is to specify the pattern as */folder/* which will match, but will also match that folder anywhere.

Sorry if it's not clear, the behavior is very unpredictable. The best bet is to enable --debug and try it by yourself. Also, all these tests are on Windows, it probably changes for Linux.

sagg155 commented 3 years ago

It's not just "relative paths". If I have mixed data files, such as this, I cannot effectively exclude one while including the other:

       dirname/log.1
       dirname/log.2
       dirname/log.3

       dirname/log.foo.1
       dirname/log.foo.2
       dirname/log.foo.3

As soon as I use any "--include=" statement that would include both, no combination of arguments allows me to exclude the other logs. Combined with the non-intuitive behavior of "--exclude=*" having to come before "--include=[whatever]", it makes for a very confusing operations.

      aws s3 sync --exclude=* --include=log.* --exclulde=log.foo.* dirname/ s3://bucket/dirname/

--exclude= worked for me, not able to achieve exclusion of entire folder using --exclude '' and include some specific files using --include 'main.js' --include 'main.css'