MathieuLoutre / grunt-aws-s3

Grunt plugin to interact with AWS S3 using the AWS SDK
MIT License
294 stars 90 forks source link

Any way to do an upload with a `noReplace` option? #6

Closed SimplGy closed 11 years ago

SimplGy commented 11 years ago

First off, this thing works super-great. I tried https://github.com/pifantastic/grunt-s3 first but it's kinda flakey and uses non-standard file specification.

Is there a no replace option I'm missing, or is that pretty far ahead in terms of milestones? The use case would be sort of like a sync, but easier to implement maybe? I build actual changes into new directories anyway :)

MathieuLoutre commented 11 years ago

Thanks! I'm happy to hear that it's helpful :)

Another user mentioned this and I need to look into it. AWS provides some data that I could maybe check against instead of having to download the files. I'll have a read in the coming week.

Just so I'm not misunderstanding, I see it as a flag that prevents re-uploading the files that haven't changed.

SimplGy commented 11 years ago

Yes, exactly.

Seems like deployment is the most common use case for grunt-aws-s3, so a way to do a sync deploy would make the deploy faster and less expensive.

Very useful.

MathieuLoutre commented 11 years ago

Hi @SimpleAsCouldBe, the development branch has a first draft of a sync action that just does the noReplace. I'm looking at a nice way to also delete the objects on the server that have been deleted locally.

SimplGy commented 11 years ago

Hey awesome! The delete is way less important to me and sometimes undesireable so I'm stoked!

What is the basis for comparison of files? Timestamp? MD5? Filename?

MathieuLoutre commented 11 years ago

Yes, for the real sync (delete on the server what's not there locally and delete locally what's not on the server, download what's on the server but not there (or outdated) locally), I'm wondering if it's really desirable.

So the check that is done is based on filename first, then MD5, then timestamp. However, I'm not too sure about keeping the timestamp test. If the MD5 is different but the server timestamp for that file is newer than your local one, then it means someone may have uploaded something you don't have (in that case, should it be downloaded?). However, that can only happen if two users upload to the same bucket and don't use a version control system to exchange their files.

What do you think?

Hopefully I will have the time to do some more testing this weekend and maybe release a first version of the sync. I'm still looking at a better name considering that's not exactly what it's doing.

SimplGy commented 11 years ago

Wow @MathieuLoutre. It sounds like you are making this very robust.

I'd like to hear others' opinions but I'll try to generalize my experiences.

For the use case of deploying code with versioned source, I think you can provide 90% of the value with 10% of the features. Let's consider this use case. I think it's pretty standard, but then most people think what they do is normal :)

Use Case

Given this use case, you could provide an awesome deployment experience that doesn't re-send bunches of files with only a few rules.

The Simplest Rules That Could Possibly Work:

# For each file...
unless filename exists on target
  deploy file
else
  if timestamp is different
    deploy it

My thought is timestamp, if reliable, might be enough to show difference in this case--md5s can take time to generate, slowing things down. Typically I just want the deploy to mirror whatever I'm sending. Skipping files is only a time and bandwidth saving luxury.

If two team members are trying to deploy at the same time, this would get weird, true. But I think this is a minor problem because deployment is either A: Significant -- So everyone knows about it (eg: "hey guys, we're pushing the button to go to production"), or B: Automated -- eg: Team City builds the CI branch and automatically deploys.

Collisions should be minimal and source is in version control so they're easy to recover from.

If you wanted to get fancier the rules could perhaps extend like this:

# For each file...
unless filename exists on target
  deploy file
else
  if local timestamp is newer and md5 is different
    deploy it

...I'm not sure about that timestamp though--I think s3 has some options to override file timestamps and with timezones and what-not it makes be a little nervous :) Maybe the md5 is the right way to check but I wouldn't want the build to slow because of it.

Ideas for flag names:

aws_s3: options:
  differentFilesOnly: true # only send files if they don't exist or are different 
aws_s3: options:
  skip: 'filenameMatch' # skip file if the filename already exists
aws_s3: options:
  action: 'sync'
  comparator: 'timestamp' # other options 'md5', 'timestampAndMd5'
aws_s3: options:
  action: 'differential' # as in "differential backup" http://typesofbackup.com/
  comparator: 'md5'
MathieuLoutre commented 11 years ago

Hi @SimpleAsCouldBe, that's awesome feedback, thank you very much.

I agree with you, deployment being either an important "team thing" or automated, having a clash where the version on the server is newer than the local version doesn't seem very likely.

At the moment, I'm doing exactly your second option (MD5 test + date check). I also think, like you, that MD5 should be enough to decide wether to upload or not. I don't really like relying on last modified dates etc. Fortunately, to do the produce the MD5 hash I have to do the same thing as just uploading the file (that is reading the file and get the buffer then hash it and compare it to the ETag given by AWS).

Therefore just doing the MD5 shouldn't make things slower. I'll remove the date check for now and if someone comes up with a situation where it's useful I'll be happy to include it again.

As for the option name I like "differential" very much. I think in the end that it won't be a separate "action" but a mode. So you could also have a differential download (download everything that's different on the server) and a differential delete (delete everything on the server that's not here locally) in the future (not really sure about including them now although with all 3 differential action you could make up your own "sync").

I'm also working on a way to make robust unit tests for these. I've included a debug mode in the dev branch so I think I could compare the output with the expected one but you'd still need a bucket and some AWS credentials to run the tests...

I'll keep you posted and thanks again for the feedback, it's really appreciated :)

SimplGy commented 11 years ago

I'm glad you found it helpful, @MathieuLoutre. I'll probably be using this plugin for every front-end project I do going forward, so I am happy to help.

Wow, I guess unit testing is tricky for something this service-oriented. Depending on an S3 bucket would be an integration test, but may still be the most useful thing. If you want tests decoupled from AWS, could you record every response from the AWS service and feed those into your tests as canned responses from mock versions of those methods.

I don't know the actual chain of events, but for example:

local files:      a.js, b.html
remote files:   a.js
aws upload     **
aws response "a.js exists with hash 2l52345lk3jlk3lk4j"
if a.js.hash != a.js.remote.hash
  aws upload b.html
else
  aws upload a.js, b.html
aws response "b.html uploaded"

assert response contains "b.html"
assert response does not contain "a.js"

I guess you could mock every method in the AWS library you call and have it return what you'd expect for each situation, then test that your code reacts to the response in the way you want. I'm sure it's not as simple as this example though :)

MathieuLoutre commented 11 years ago

Well, I'm toying with building a JSON database representing the buckets and mock the S3 SDK interface to it at the moment (just like you mentioned). It's a lot of work so I'll see if I can make it happen but it would reassure me a lot about actions like delete.

Thanks again for your input!

MathieuLoutre commented 11 years ago

Hi @SimpleAsCouldBe! Good news is the dev branch now has what looks like 0.7.0. It has the differential option and has tests using a mock interface to AWS S3. If you could have a try and tell me what you think before I publish that would be awesome.

Also, some thinking on the differential option. It can later become an object or a string where the comparator option could be (as you mentioned before) if it is ever necessary. Then later, I can make differential work with downloads and deletes and you could virtually do your own sync.

(Try to run the tests too if you can!)

MathieuLoutre commented 11 years ago

I've updated the dev branch with differential download and delete. I've tested as much as I could and it seems relatively solid. I'm waiting a little more to release it on npm. Give it a try if you can!

MathieuLoutre commented 11 years ago

Hi @SimpleAsCouldBe! I've released the package under 0.7.0. It includes the differential upload we were talking about as well as the differential download and delete. Let me know if you run into any problems.

SimplGy commented 11 years ago

I finally had a chance to give this a shot. All tests are green. Love it.

All I had to do to leverage it is add the differential:true option. Super easy to integrate the new feature into the build.

I LOVE the console output that is different colors and shows a === sign when the file matches. Very confidence building. Might go green instead of yellow as it's a positive indication rather than a warning, but it doesn't matter much.

Thanks! Hope this becomes a grunt-contrib plugin, it's the ultimate grunt+aws combo.

MathieuLoutre commented 11 years ago

I'm glad that you like it! I don't know the criteria to become a "contrib" plugin but I suppose I could always ask if they need it :)

As for the yellow, I wasn't sure either but I figured I want to make sure that people realise that this file hasn't changed. It's a conflicting message: "Great, didn't have to upload this file, however you want to make sure this is what you wanted to happen". So far it works. I keep the green for the last part of the log and the upload dots, as I see it as "operation completed".

Thanks again for your help!