Upload zip instead of each file one by one?

PierrickLozach commented 4 years ago

This is a:

[x] Feature request or change
[] Bug report

Related to #81

For feature requests or changes:

Current behavior (if any)

Currently uploads each file one by one. My build folder has 500+ files and I reach the free tier limit (2000 PUT/POST S3 requests) quickly.

Proposed behavior

Can serverless-finch upload a zip file then uncompress locally (just like serverless does for Lambda functions)?

Justification

Save money :-)

fernando-mc commented 4 years ago

Hi @PierrickI3,

Happy to take a look at this if you think there are API options here that might accomplish it. I'm not sure that S3 supports this sort of approach at the moment?

PierrickLozach commented 4 years ago

Hi @fernando-mc , thanks for getting back to me.

Uploading a zip file then uncompressing it requires a lambda function that needs to be triggerred to uncompress the file in the same bucket or in a different one.

Could you maybe add a --zip option to upload one compressed file instead of uploading each file separately. This way, we could add notes in the README on creating a Lambda function that will be triggered to uncompress the files in the same bucket or in a target bucket.

fernando-mc commented 4 years ago

I guess I'm not entirely sure what this would resolve? Where would the zip file go? Lambda /tmp storage? S3 itself?

Either way, I think Lambda still has to unpack the zip file in /tmp and then upload it to S3 file by file?

Can you write out the SDK call flow you think would make sense here and showcase how it might save on cost? I know network data costs are a thing in AWS but I'm not sure how much this might actually end up saving anyone. If there's a compelling cost savings I'm open to it, but I think at some point the S3 requests still need to happen?

PierrickLozach commented 4 years ago

From my point of view (unless I'm wrong), uploading a zip file is only one PUT request. Then, AWS does not charge when copying files from bucket to bucket.

So, the flow goes like this:

Upload a zip file of ./build to a bucket
Set up a Lambda function that is triggered when .zip files are uploaded in that bucket
Lambda will then uncompress all files

You could eventually use a separate bucket to store the .zip file then uncompress onto another bucket but that would be more complex.

From a user point of view, I would see the command like this: serverless client deploy --zip. It would be up to the user to set up the monitor Lambda function.

Makes sense?

fernando-mc commented 4 years ago

But how would Lambda uncompress the zip? I don't think you can treat S3 like an attached drive to Lambda? https://forums.aws.amazon.com/thread.jspa?threadID=46575 https://www.quora.com/How-do-I-extract-a-zip-file-in-Amazon-S3

This might be an option - https://medium.com/@johnpaulhayes/how-extract-a-huge-zip-file-in-an-amazon-s3-bucket-by-using-aws-lambda-and-python-e32c6cf58f06

But I'm still unsure this would do what we need.

PierrickLozach commented 4 years ago

It will be indeed very similar to the option link you added in your comment.

Lambda functions can be set to monitor files with specific filetypes (e.g. .zip) and start when a file that matches is added. The Lambda function can then take the file and unzip it.

fernando-mc commented 4 years ago

If this were being uploaded to an EC2 instance for hosting I'd agree this was an important thing to do. But I'm not convinced this addresses your original concern?

Option 1 (current workflow)

Upload every file and pay for all the putObjects for all objects

Option 2 (zip workflow we're discussing)

Upload the zip to S3/Lambda
Unzip the contents
Then make all the same put requests that would happen in part 1 of option 1

I'm ignoring network costs here which may be more substantial at scale. But I don't see how option 2 is an improvement yet?

el2ro commented 4 years ago

Hi, I just had the same "problem" and found this. Yes free tear runs out fast (500+ items in static web page)

I was hoping that it would make it cheaper, but price might be the same?

Option1:

Well pricing is pretty cheap anyway like ~0.005$ / 1000 put items, so it might not be worth the changes.

Option2: (zip flow)

Only benefit could be that it would be faster to transfer. I assume, it would be much faster to first transfer one compressed file to S3 and unpacked files trough the lambda back to S3. At least for me it took quite long time to transfer 197MB 500+ files to S3. Tar xvzf brings it down to 50MB.

Option3 for dev: (this got out of hand)

There should be smart transfer for the files. Setup lambda, ask S3 buckets md5 sum for all the files, pack local files having different md5, transfer, unpack files to S3, remove lambda.

PierrickLozach commented 4 years ago

@fernando-mc as long as the S3 buckets (source and target) are in the same region, there is no extra cost for copying between S3 buckets.

so your Option 2 seems the valid option for me.

fernando-mc commented 4 years ago

@el2ro Would it be faster to transfer? You're essentially doing the process twice. Once with the zip, then unzipping, then running the S3 PUTs from Lambda instead of locally. Even if you stuff it in the Lambda function package (which is a serious hack) you still have to then run a copy from there to S3.

@PierrickI3 I think it might save on data transfer costs like I was saying earlier, but don't you still pay for the PUT requests?

There are several layers of pricing and the PUT requests are one level and then you add on the data transfer costs I think (except in this case if you're transferring to a Lambda in the same AWS region. https://aws.amazon.com/s3/pricing/

This sounds like a lot of hacky work for two unverified benefits:

Speed
Cost
Uploading a zip to Lambda along with an S3 zip and then unpacking it may not be faster. You're transferring the zip and still running s3 PUTs from Lambda, Also, you should probably just be using Transfer acceleration to improve speed https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
For cost, it's not clear to me how much this would plausibly save. The PUT requests are still a requirement, so the data transfer costs are the real cost here. When using a Lambda model you could end up paying for Lambda (probably not if it's in the free tier usage still).

If someone wants to code the MVP of this and show me the metrics that make it a better solution go for it! But I'm still struggling to see the benefits here vs:

Transfer Acceleration
Compressing your assets in general (fonts, minified js, etc.)

el2ro commented 4 years ago

I did some quick test setup. Just modified how finch code works, to compress files and upload zip. Added lambda function to do the uncompress. Adding lambda and running lambda is manual task.

Codes: https://github.com/el2ro/serverless-finch Results: https://github.com/el2ro/serverless-finch/blob/master/test/zip_test/README.md Roughly 50% faster when zipping files (with my client/dist files and my network environment)

A bit disappointing results, I was hoping it to be much faster.

el2ro commented 4 years ago

I am still a bit tempted to try out the simplified Option3

First time it would work as is, but it would write down locally(or to S3) the .cache file having MD5 from all the transferred files.
Next time, it would first check if .cache locally (or from S3) file exists and compares files MD5 before sending. This allows to send only changed files so it would possible bring the DEV round up time and cost down.

Any comments to this approach?

fernando-mc commented 4 years ago

@el2ro Nice work here testing this out and showcasing the improvements.

As far as an MVP I think that demonstrates an improvement as far as speed goes. Pricing wise I still things this ends up being more expensive, but it could very well be a faster option as transfer acceleration might not even be relevant for many static websites (I found out recently it prohibits periods in the bucket name).

However, I think it may be possible to get a lot of the benefits in smaller incremental updates to the local version of the plugin and I'm curious about your thoughts on splitting out these ideas:

Creating md5 sums for all the files and uploading only changed files

We could add a config option in the finch config called hashfile or something that is a location of a JSON file inside the site folder. If there are no contents in the file the plugin could generate an md5 hash of all the files in the site folder when serverless client deploy is run. On subsequent deploys when the plugins sees contents in the file, it could compare the hashes and only upload new files.

After this, then the plugin could update the hashfile after each deployment so it can be used again next time. This does get pretty weird if this state gets messed up somehow though. We could also consider storing the hashfile in the bucket itself at the top level and calling it something like slsfinchhashfile.json and just pulling it down before deployments and updating it and reuploading it after deployments with the new md5 hashes.

I think this can all be done locally without the Lambda Function and makes sense to split off that way but I'm open to counter proposals.

The Lambda Function uploader

My main concern with the Lambda function is that it adds a lot of potential complexity. Even in the proposals above, it sounds like the Lambda function would then have to duplicate all the functionality of the existing plugin? For example:

Would the Lambda function mimic all the functionality of the existing local plugin?
Would it set object headers?
Respect download order?

Even supporting just a few of the options seem like an architectural overhaul to how the plugin works. I'm not opposed to that, but we'd have to bump a major version and backfill the support for all the existing functionality from inside the Lambda function.

If we did go this route here are a few things I'd want to keep in mind:

We should try to only have the user run serverless client deploy - everything should happen without requiring a separate serverless deploy or serverless invoke. We can enable this with some of the other suggestions below I think.
If we create a standard Lambda function to do the decompressing/md5 checking/uploading then we should create it from the plugin itself without having to rely on the Framework to deploy it (Not using the Framework!? Barbaric! I know! But it will mean less steps for the user and we save time on the CloudFormation side)
If we create a function that does all the unzipping/md5 checking/uploading it will need some inputs passed to it from the serverless.yml configuration to actually work. Presumably the architecture changes to the entire process mean that everything changes a LOT so here's what I see it could look like:

Local plugin handles:

Parsing where the site files actually are on the local machine
Potentially creating/delete the hashfile option mentioned above
Zipping up the contents of the site folder
Creating/configuring the S3 bucket
Uploading the site files zip file to the s3 bucket
Creating the Lambda function itself - probably we just have a bundled zip file with the plugin package that is used as a deployment artifact.
Invoking the Lambda function with configuration options from the serverless.yml file

The Lambda function then handles everything else:

The remaining upload process including all the additional configuration
Object headers
Upload order
Key prefixing
Encryption
Respecting all the other --no-* deployment flags and serverless options. (review this project's readme, there's a decent bit of it that is handled at upload time.)
Md5 checking
Updating the hashfiles
Uploading the files with all the metadata they need

fernando-mc commented 4 years ago

To summarize my above comments, I think the Creating md5 sums for all the files and uploading only changed files might be a good intermediary step that can be implemented now.

But I am wondering how the Lambda function would support all the current functionality/options/config of the plugin?

el2ro commented 4 years ago

I am pretty much following your thoughts.

I think that it would be better to leave out the Lambda functionality. It does not give enough benefits compared to all the troubles it presents.

I agree that using a hash file approach can cause some nasty syncing issues, but there should be user check weather use versioned upload or not. There is one more benefit that could be bundled into this approach. According to S3 documentation, there is a possibility to add integrity checks for the file uploads using the same file hash. https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/

fernando-mc / serverless-finch