aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.57k stars 4.13k forks source link

cloudformation package is always generating a new zip #3131

Open izidorome opened 6 years ago

izidorome commented 6 years ago

I have a Golang lambda with the following template:

AWSTemplateFormatVersion : '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Billing Api Create Application

Resources:
  BillingCreate:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: billing-create
      Handler: main
      CodeUri: ./build
      Runtime: go1.x
      Policies: AWSLambdaDynamoDBExecutionRole

Even when the code didn't change (Go build generates the same compiled code), aws cloudformation package command generates a new zip file.

kyleknap commented 6 years ago

Could you elaborate a little more on why it is an issue that a new zip is formed for every package command? It may be difficult to avoid making a new zip every time because the package command does an md5 of the zip it creates to see if it needs to reupload the code to s3 by comparing the two md5's.

izidorome commented 6 years ago

Imagine a scenario where you have a cloudformation file with more than one Lambda declared. For now, let's call it FN1 and FN2, one is at fn1.go file and the second at fn2.go.

I build both of them, which generates two binaries fn1 and fn2.

I run cloudformation package, and it generates 2 zip files and send them to S3.

One week later, I change the fn1 function, but not the fn2. My CI builds both of them, but only the first has a different MD5 (the second has the same MD5 as before).

The problem here is the package command will generate a new zip for the second one too, even if the file did not changed, which causes all my cloudformation declared functions to be deployed.

jakul commented 6 years ago

I'm having the same issue with Python code. Every time I run aws cloudformation package it creates/uploads a new zip file and changes the CloudFormation template

jakul commented 6 years ago

@rizidoro Can you download the zip files from S3, unzip them locally and diff them? Turns out I had one file which was actually different, because it included a "generated at" date which was being updated everytime I built the CloudFormation script

jakul commented 6 years ago

You also need to check for timestamp differences amongst the files

jmassara commented 6 years ago

the package command does an md5 of the zip it creates to see if it needs to reupload the code to s3 by comparing the two

That is exactly the issue. If timestamps on files are different in the zip file, even if they are the same contents, the md5 is different. In the case of scripting languages, this is probably not an issue. However with Go, each time you run go build, a new binary is created and thus a new timestamp.

This is especially troublesome if you are trying to use CodePipeline and CodeBuild (see https://docs.aws.amazon.com/lambda/latest/dg/automating-deployment.html) because no matter what, package is always going to create a zip with a different md5.

Perhaps package should md5 each file in the zip instead of the zip as a whole. As it is now, it's not an accurate comparison.

izidorome commented 6 years ago

@jmassara exactly the problem I'm facing right now. The final binary go build generates change the timestamp.

jmassara commented 6 years ago

@rizidoro Yes. This is a bug with package. It should probably create a temporary file that has a list of the md5 hashes of all files going into the zip. Then md5 this temporary file and use that value as the name of the S3 object.

atamgp commented 6 years ago

I have the same issue. Have a CodeCommit repo with a sam.yml containing multiple lambdas.

When from my VM i use aws cli on the promt 2 times after each other, the frist will upload a .ip for every lambda. The seconds one does nothing because nothing changed = correct.

But... doing exactly the same from CodePipeline , CodeBuild (aws cloudform package bla bla) it does not work. You can trigger the pipeline with "Release Change" without needing a commit which will trigger the pipeline. It start a aws cli docker for CodeBuild, gets the input sources from S3 and unzips them. Calls cloudformation package which DOES reupload unchanged code for every lambda causing redeployment in next steps.

  1. How does not anyone using CodePipeline and Lambdas run into this?
  2. It seems that fetching unchanged sources from S3, unzipping them en doing package leads to other MD5 which is NOT OK.

Does anyone know a workaround and when this bug will be fixed?

paul-wilkinson commented 6 years ago

I am having the same issue. I'm finding reviewing CloudFormation change sets painful because they are polluted with changes to Lambda resources that didn't materially change.

rmmeans commented 6 years ago

I'm seeing the same problem as @jmassara reported above with node. This one is painful for us as we are trying to use a CodePipeline to deploy Lambda@Edge functions with the CDN in the stack - even if we don't touch the functions, the CLI during packaging thinks the files changed resulting in a CDN update (wait 15 min) even if we didn't change anything in the function code. It is far more than just an unnecessary version publish in the change set - slows the entire CD process down unnecessarily because of how slow CloudFront updates are.

vaibhavkewl commented 6 years ago

Hi, Is there any progress with this feature-request? Comparing the md5sum of each file within a zip instead of md5sum of zip file sounds like a good possible solution for this problem. Appreciate your thoughts and a possible fix for this. We have a CI/CD pipeline with many lambda functions and this problem is causing a new version of aws lambda being deployed everytime unnecessarily.

mruckli commented 5 years ago

We are also facing this exact issue.

Umkus commented 5 years ago

@rmmeans I have exactly same issue. This not only slows down the deployment, but also the rollbacks.

okovalov commented 5 years ago

Guys, my question is not 100% related to this particular bug (I bypassed it by having different and separated lambdas) but there is smth I really cant bypass and I am giving up on it.. I would really appreciate any help/suggestions - please take a look at this error

image

that package command fails when i have too many deps added to my package.json, and unfortunately, do to the nature of the lambda, there is no way to decrease files amount..

so, is there any way, to , actually, run it with zip64 support ? please help.. I have already given up on this...

bjorg commented 5 years ago

The solution may depend on the programming language (and therefore, potentially not possible for some). We solved it in the λ# CLI as follows:

.NET Core has a deterministic build system, which means that if the source files and nuget packages have not changed, then the resulting compiled binaries remain identical as well. During the build phase of the package, the CLI creates a checksum of the file contents and filenames instead of the ZIP file itself. The latter contains date & timestamps that would cause the checksum to change with every build. The result is a package filename that only changes when the underlying code changes, which in-turn, only updates Lambda functions--or Lambda layers--when required.

Anheurystics commented 5 years ago

Any updates on this issue?

dan-lind commented 5 years ago

I'm facing the exact same problem

wmonk commented 5 years ago

I've also been suffering this issue. I am using the sam-cli and have been trying to optimise the time to run sam package and sam deploy. So far I've got to a nice place using a node script to pre-package each of the 29 lambdas into their own directory with the required node_modules. This is important so that I can make code changes in one file, then run deployment, and it'll very quickly deploy the lambdas for which that file change was necessary. Best case it'll affect 1 lambda and my deployment will take a few seconds.

As per the rest of the conversation in this issue, the md5 of the zip is different each time. Here is a demonstration:

~/C/t/test ❯❯❯ mkdir out
~/C/t/test ❯❯❯ touch out/test
~/C/t/test ❯❯❯ echo "Hello world" > out/test
~/C/t/test ❯❯❯
~/C/t/test ❯❯❯ md5 out/test
MD5 (out/test) = f0ef7081e1539ac00ef5b761b4fb01b3

~/C/t/test ❯❯❯ zip -rqX out.zip out
~/C/t/test ❯❯❯ md5 out.zip
MD5 (out.zip) = 5f28021c0b6fc266abbfb1b36870fa1d
~/C/t/test ❯❯❯
~/C/t/test ❯❯❯ zip -rqX out2.zip out
~/C/t/test ❯❯❯ md5 out2.zip
MD5 (out2.zip) = 5f28021c0b6fc266abbfb1b36870fa1d
~/C/t/test ❯❯❯ # Same md5!

~/C/t/test ❯❯❯ echo "Hello world" > out/test
~/C/t/test ❯❯❯ md5 out/test
MD5 (out/test) = f0ef7081e1539ac00ef5b761b4fb01b3
~/C/t/test ❯❯❯ # Same md5 for file!

~/C/t/test ❯❯❯ zip -rqX out3.zip out
~/C/t/test ❯❯❯ md5 out3.zip
MD5 (out3.zip) = 1a8ec423697ce9c657b6f1c12c51476f
~/C/t/test ❯❯❯ # Different zip file md5!

Digging into the source code for the zipping + uploading functionality you can see that the code walks the file tree and adds each file to the zipfile: https://github.com/aws/aws-cli/blob/384ae0aec97a706d1ff9ca9ce206dc93c9667038/awscli/customizations/cloudformation/artifact_exporter.py#L183-L196

My proposal would be that in this step it also md5s all the files adding to the zip, and then finally md5s the total. Not sure what the perf impact would be doing this, but it should make the final deployment significantly faster if doing this kind of thing.


I've tested locally on a lambda with a small 😛 sized node_modules, total directory size ~20mb:

~/C/g/a/.s/Api ❯❯❯ time find . -type f -exec md5 \{\} >> ../out.md5 \;
       10.51 real         3.18 user         6.76 sys
~/C/g/a/.s/Api ❯❯❯ md5 ../out.md5
MD5 (../out.mdf) = 6e6584c968e3974b60ba7b4e244a84b5

This was for 3098 files.

bjorg commented 5 years ago

Yes, that's close to how it's done in λ# for the .NET zip packages. Make sure to sort the files by their full path first, then MD5 the file contents and the file path. If you omit the latter, the MD5 doesn't change when you change capitalization of a file!

See details at https://github.com/LambdaSharp/LambdaSharpTool/blob/9767b96fda1c459f21ebf68c1dd18670970c012d/src/LambdaSharp.Tool/Internal/StringEx.cs#L164

wmonk commented 5 years ago

@stealthycoin would there be any appetite for a PR implementing this?

wmonk commented 5 years ago

@stealthycoin any update on this? I'd be happy to take a crack at a PR to implement the behaviour discussed.

hatim-heffoudhi commented 5 years ago

hello guys, any updates please :) ? im facing the same issue, i have a multiple lambdas in monorepo once i update a lambda, the sam package generate multiples s3 zip files for the others even if i ddidnt any changes.. its a bug or feature request ?

gpiccinni commented 4 years ago

Hi all, I've created a pull request which seems to solve the issue we were facing, where basically we compute the checksum on the entire function content (after installing all requirements) rather than computing it on the resulting ZIP file (the current behavior). The main difference is that when computing checksum on the ZIP it changes every time a file is created (it keeps into account file mtime and ctime) even if there is no actual change in the file content.

It would be great if this pull gets accepted and merged. Thanks. G

wmonk commented 4 years ago

@gpiccinni I implemented a similar solution to yours in September here https://github.com/aws/aws-cli/pull/4526, but unfortunately nothing ever came of it.

gpiccinni commented 4 years ago

@wmonk many thanks for pointing this out, by looking at your pull request I realized that in my case checksum is not changing when filenames change (which in my opinion should), whereas in your code you already addressed this !

I'll look into other libraries such as dirhash where the filename and path is included in the checksum and eventually change my pull request.

Thanks G

hatim-heffoudhi commented 4 years ago

@gpiccinni , awesome !!! and thanks ! i hope that your PR can be merged quickly ! this can fix a lot of pipelines..,

rsodha commented 4 years ago

That is exactly the issue. If timestamps on files are different in the zip file, even if they are the same contents, the md5 is different. In the case of scripting languages, this is probably not an issue. However with Go, each time you run go build, a new binary is created and thus a new timestamp.

@jmassara This problem exists for scripting languages also. I am facing same problem with Node.js lambdas. Looks like it is due to zip headers. Have a look at this stackoverflow discussion.

rehanvdm commented 4 years ago

Well the CDK team does not have this problem? Find out what they are doing and do the same

wmonk commented 4 years ago

After being frustrated at this issue for a while, i've fixed this in my own deploy scripts. Hopefully this can help some other, and maybe get some optimisations! I'm not sure if this is the "right" way to do it, but it's been working fine for us. One big benefit i've found is that I can make config changes without having to redeploy every function that relies on code (that hadn't changed).

find src -type f -exec md5sum {} \; > tmp-md5
find node_modules -type f -exec md5sum {} \; >> tmp-md5
CODE_MD5=$(md5sum tmp-md5 | cut -c 1-32)

if [ ! -f "$CODE_MD5" ]; then
    zip -q -r $CODE_MD5 src node_modules # more files here
fi

aws s3 ls s3://bucket-name/$CODE_MD5 || aws s3 cp $CODE_MD5.zip s3://bucket-name/$CODE_MD5

sam deploy --parameter-overrides CodeUriKey=$CODE_MD5
Parameters:
  CodeUriKey:
    Type: String
    NoEcho: true

Lambda:
  Type: AWS::Serverless::Function
  Properties:
    CodeUri:
      Bucket: bucket-name
      Key: !Ref CodeUriKey
rsodha commented 4 years ago

I have found another workaround (may be easier for those who have many lambda functions in one pipeline) to this issue.

Key to this workaround was to find out what contributes to different md5 of a zip even if contents of files within zip have not changed. I found 'Modified Timestamp' of files to be culprit. So idea is; if we can have consistent 'Modified Timestamp' on all files just before 'aws cloudformation package' or 'sam package' command is run, produced zip files will have consistent md5 across build executions.

find . -exec touch -m --date="2020-01-30" {} \; # date does not matter as long as it  is never changed.
aws cloudformation package --template-file template.yml --s3-bucket <bucket> --output-template-file package-template.yml

Above trick has worked for me so far.

Al-tekreeti commented 4 years ago

does not work for me

ShengHow95 commented 3 years ago

I have similar issue but with lambda layers instead. I have my template on codecommit and I created a codepipeline with a codebuild that will automate the cloudformation package and deploy process. However, everytime if there is any changes to the codecommit even the lambda layer did not change, it will still create a new lambda layer.

Anyone here has got any alternatives?

ryancabanas commented 3 years ago

I have similar issue but with lambda layers instead. I have my template on codecommit and I created a codepipeline with a codebuild that will automate the cloudformation package and deploy process. However, everytime if there is any changes to the codecommit even the lambda layer did not change, it will still create a new lambda layer.

Anyone here has got any alternatives?

I have this same issue. While the suggestion from @rsodha does work to prevent most duplicate packages from being uploaded by the aws cloudformation package command, the AWS::Serverless::LayerVersion layer that I've created keeps getting re-uploaded, even when there are no package changes. I believe the reason is due to the CODEBUILD_SRC_DIR path, which is different every time an AWS::CodeBuild::Project is generated as part of my CodePipeline run. This CODEBUILD_SRC_DIR path is saved inside the package.json files that are created when I download the needed npm packages for my Node Lambdas (but doesn't appear to be an issue for the Python packages). Because of this, the layer hash is always different and, therefore, gets re-uploaded every time.

If there were a way we could manually set the CODEBUILD_SRC_DIR path to a static value every time the AWS::CodeBuild::Project is generated in the CodePipeline's CloudFormation template, then that might be a solution to this issue.

ryancabanas commented 3 years ago

I have similar issue but with lambda layers instead. I have my template on codecommit and I created a codepipeline with a codebuild that will automate the cloudformation package and deploy process. However, everytime if there is any changes to the codecommit even the lambda layer did not change, it will still create a new lambda layer. Anyone here has got any alternatives?

I have this same issue. While the suggestion from @rsodha does work to prevent most duplicate packages from being uploaded by the aws cloudformation package command, the AWS::Serverless::LayerVersion layer that I've created keeps getting re-uploaded, even when there are no package changes. I believe the reason is due to the CODEBUILD_SRC_DIR path, which is different every time an AWS::CodeBuild::Project is generated as part of my CodePipeline run. This CODEBUILD_SRC_DIR path is saved inside the package.json files that are created when I download the needed npm packages for my Node Lambdas (but doesn't appear to be an issue for the Python packages). Because of this, the layer hash is always different and, therefore, gets re-uploaded every time.

If there were a way we could manually set the CODEBUILD_SRC_DIR path to a static value every time the AWS::CodeBuild::Project is generated in the CodePipeline's CloudFormation template, then that might be a solution to this issue.

After many attempts, I still could not prevent a new Lambda Layer from being generated during each CodePipeline run. I tried the following:

I've downloaded a couple of Lambda Layer .zip files that didn't change between CodePipeline runs and checked their MD5 hash values and they are indeed different for some reason. The size of the files are different too (for example, 16,461,107 bytes vs. 16,461,114 bytes), but I can't figure out what the differences are between these two, as I've unzipped them and performed a directory comparison using the comparison tool Meld and it doesn't report any file differences.

So, I'm out of ideas as to why a new Lambda Layer is always generated and how to stop this from happening.

Any other ideas out there? Thanks.

bjorg commented 3 years ago

@ryancabanas the file dates are probably different. Different values also means different compression level. I had to solve this problem for LambdaSharp.Net as well. You have to MD5 only the file paths and file contents in the ZIP file to make it an idempotent process.

ryancabanas commented 3 years ago

@bjorg Thanks for helping! I am using the suggestion above from @rsodha and resetting the modified date for all the files, so they are consistent in that respect from build to build.

Any suggestions on how to go about determining what else could be different between the files from build to build? Thanks!

bjorg commented 3 years ago

@ryancabanas not sure, but isn't there a modified and a created timestamp on files? Could that be it? Do folders have timestamps? Does the zip file itself have an internal timestamp?

I'd recommend you write a little app that opens both zips and compare the metadata of all entries. If the files are the same, it's must be the metadata. Most zip libraries are pretty easy to use. It's almost identical to comparing two folders. This might be frustrating, but so is guessing blindly.

Sorry I couldn't be of more assistance.

ryancabanas commented 3 years ago

@bjorg Okay. I'll dig further in the ways you've mentioned. Thanks!

kyptov commented 3 years ago

@ryancabanas did you try aws-cdk? It looks like it generates same hash for same contents each time.

rehanvdm commented 3 years ago

CDK fanboy here. They don't have this problem, the cdk-assets does things like normalize file dates and line endings before zipping.

But @ryancabanas what you are describing the CODEBUILD_SRC_DIR is different has an impact on the package.json. TL;DR It is the wild-wild west within the node_modules directory, it mutates after installation and is the cause for non-deterministic hashing.

Some packages embed the absolute path in the package.json after installation and then because CODEBUILD_SRC_DIR is different, it forces that package.json to be different. I wrote about it here: https://www.rehanvdm.com/blog/cdk-shorts-1-consistent-asset-hashing-nodejs It is not actually a CDK or CFN problem but rather an NPM one.

The solution is to either remove the package.json from every node_module/ package so that when the hash is calculated, they are excluded. The better solution is to use bundling, a tool like ES Build treeshakes and bundles all your code into a single .js file. This is the only file in the zip then, so no package.json anywhere.

ryancabanas commented 3 years ago

@rehanvdm Thanks for your article! Yes, what you said about the package.json metadata, namely the CODEBUILD_SRC_DIR path, is exactly what I discovered. I performed a test where, in CodeBuild, before anything else, I changed the src... folder name to a consistent name (for example, I always change it to src123456789) and this has resulted in .zip file contents that were then the same from build to build, but a new Lambda Layer is always uploaded still, even when it hasn't changed from build to build. I also used the suggestion above and changed the dates of all the files to a consistent date, but this hasn't solved the problem either.

I'm new to development and AWS, so I haven't used CDK before, or bundling. I will have to look into these. Thanks for the help!

ryancabanas commented 3 years ago

Got it!

So I used the folder-hash package that @rehanvdm mentioned in his article and this helped reveal differences between my Lambda Layer assets. I had already taken care of the CODEBUILD_SRC_DIR issue in the package.json files for Node, but I'm also using a couple Python packages and it seems the .pyc files differ in the __pycache__ folders from build to build. So after installing the packages, I have deleted these .pyc files and now no more unnecessary Lambda Layers are being created and uploaded! Thanks for the help!

KyleThen commented 3 years ago

For me my issue was the I was creating the bundled zip using linux's zip command. I needed to use the -X option so it didn't add all the extra attributes to the created zip. I also included deleting the .pyc files and setting the last modified date for all the files to be the same in the solution so I'm not positive which combination of them is needed.

ConnorKirk commented 2 years ago

I've also encountered this issue when using CodeBuild to package a lambda functions and layers in a CloudFormation template.

As a workaround, the sam cli does not seem to have this behaviour (anymore?), and is included in the aws/codebuild/standard:6.0 CodeBuild image. I was able to swap aws cloudformation package for sam package in CodeBuild buildspec to work around this issue.

jtheuer commented 1 year ago

I still have the same problem with aws cloudformation package for Lambda functions that are pointing to a local .py file. Even setting the mdate of my source files to a fixed date didn't help: touch -a -m -t"201001010000.00".

The generated zip file always has a different checksum.

What I would like to have is: When running cloudformation package, cloudformation deploy on the same source files then cloudformation must not re-deploy unchanged resources.

Are you able to implement that?