aws / ec2-image-builder-roadmap

Public Roadmap for EC2 Image Builder.
Other
34 stars 7 forks source link

Provide WebDownload Checksum algorithms consistent with shasum algorithms #96

Closed jrstarke closed 6 months ago

jrstarke commented 8 months ago

Community Note

Tell us about your request What do you want us to build?

WebDownload checksum algorithms that are consistent with those of shasum.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

I'm trying to ensure that the artifacts that I'm downloading as part of my build are consistent with those I expect and haven't been tampered with. The SHA256 algorithm consistently yields different results than sha256sum does for the same artifact.

Are you currently working around this issue? How are you currently solving this problem?

Run the image recipe once, wait for it to fail, get the checksum that it has for the same resource and feed it back in. This doesn't inspire trust in the resources though, as I can't verify that it's actually getting the same thing I'm getting.

Additional context Anything else we should know?

Wrote this up on Re:Post

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

jrstarke commented 8 months ago

According to David Cuthbert on my Re:Post question, it appears that WebDownload is also unzipping the gz file before checking the checksum. The value I found in the logs was consistent with the checksum of the unzipped Tar file.

jrstarke commented 8 months ago

The second thing that I downloaded with WebDownload had the correct SHA. Seems like https://www.pdflib.com/binaries/PDFlib/1001/PDFlib-10.0.1-Linux-x64-php.tar.gz has content encoding of x-gzip, so it gets unzipped on download. The other site doesn't have a content encoding, so doesn't get unzipped.

austoonz commented 8 months ago

Thank you for reporting this!

Confirming this is definitely a bug. I've cut a ticket internally to the team and will keep you posted on the fix.

dacut commented 8 months ago

Ugh. @jrstarke, if they're sending a Content-Encoding: x-gzip header, then this is not a bug in EC2 image builder.

From the MDN docs on Content-Encoding:

If the original media is encoded in some way (e.g. a zip file) then this information would not be included in the Content-Encoding header.

Image builder is acting correctly here. The original site needs to drop the Content-Encoding header or the URL should end with .tar, not .tar.gz.

austoonz commented 6 months ago

@dacut - whether Image Builder is acting correctly due to the content encoding of the source, or not, we've treated this as a bug. It's painful and customers shouldn't have to deal with this.

@jrstarke - This is now resolved. The following component YAML will now run with success with the latest versions of the AWSTOE binary (published within Image Builder and to the S3 Buckets outlined in the AWSTOE Downloads section of the service documentation).

schemaVersion: 1.0
phases:
  - name: build
    steps:
      - name: download
        action: WebDownload
        inputs:
          - source: https://www.pdflib.com/binaries/PDFlib/1001/PDFlib-10.0.1-Linux-x64-php.tar.gz
            destination: /tmp/PDFlib-10.0.1-Linux-x64-php.tar.gz
            algorithm: SHA256
            checksum: 31c589c76d96965ddeec3e3d89c0bf5322513dbe3f523dcc8d2352c6167cdc71
dacut commented 6 months ago

@austoonz - The edge case you may have to deal with is potentially accepting multiple checksums (for the .tar and .tar.gz). If the original is a .tar file and the server decides to compress it on-the-fly (the intent of the Content-Encoding header), the checksums won't match.

While raw .tar files are unusual, the typical case I've seen is a file containing multiple binaries that are already compressed (or compressed + encrypted), such as firmware or media.

austoonz commented 6 months ago

@dacut - definitely good to know for sure, thank you. I'll note this to the team for awareness for now.

I'd imagine the use-cases for WebDownload are far more likely to download .tar.gz files (as they are more common), so I suspect this is something (with raw .tar files) we'll evaluate and solve if (or likely when) customers run into the issue you describe.

At least there is still the workaround to use curl or wget directly for any scenario where WebDownload isn't working as intended.