aws-solutions / serverless-image-handler

A solution to dynamically handle images on the fly, utilizing SharpJS
Apache License 2.0
1.33k stars 535 forks source link

Improving cache hit ratio #304

Open ddonahue99 opened 3 years ago

ddonahue99 commented 3 years ago

My company recently deployed the serverless image handler, and it was a breeze - nice work! One thing we've noticed that has been a little surprising is a lower than expected CloudFront cache hit ratio, and we'd love to be able to get the Lambda costs down. My assumption is that the serverless image handler is caching at each CloudFront edge location, so for a given image requested from several places around the globe, it will need to hit the lambda multiple times. Over time, those cached items will expire and will need to be re-hydrated again. Is that correct?

Assuming that's what's going on, a couple options come to mind for optimizing the hit ratio:

1) Cache the converted images in S3, rather than relying solely on the CloudFront cache. Storage costs would be higher, but it would need to hit the lambda exactly once for a given set of image parameters. This would obviously require some fundamental changes to the serverless-image-handler.

2) For a lighter approach, would CloudFront Origin Shield solve this problem? Would need to crunch the numbers to evaluate cost implications, but it seems like it exists for this sort of use case.

Thanks in advance for any guidance, and please let me know if there are any other options I am not considering.

gattasrikanth commented 3 years ago

Thanks for using Serverless Image Handler Solution. I have added this to our backlog items list and our dev team will look into possible solutions to optimize.

Buthrakaur commented 2 years ago

Hi @ddonahue99 , do you have any experience with the CloudFront Origin Shield already? I'm just thinking about using it too to at least somehow limit the Lambda execution count/time..

ddonahue99 commented 2 years ago

Hi @Buthrakaur - Since posting this, I've made a few changes that have greatly improved the hit ratio, including enabling Origin Shield (which resulted in a modest improvement).

The more notable impact, however, was modifying the CloudFront cache settings. I bumped the TTL up to the max (1 year) and changed the cache key to not include the origin and accept headers. From what I could tell, the accept header is part of the cache key by default for the AUTO_WEBP setting, which makes sense, because depending on the client, the response could be webp or jpeg or whatever other fallback you specify. If you are not using AUTO_WEBP, the response will always be the same, so it doesn't make sense to have roughly one cache entry per major browser:

Example of how accept headers vary by browser:

firefox = image/webp,*/*
safari = image/webp,image/png,image/svg+xml,image/*;q=0.8,video/*;q=0.8,*/*;q=0.5
chrome = image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8

With all of these changes, my application was hovering around a 70-75% hit ratio and is now closer to 96%.

Ultimately, the better solution for optimizing the hit ratio would be to permanently cache the output in S3. I'd still love to see that as a built-in option to this template. 🙏

fvsnippets commented 2 years ago

Hi!

Noticed another possible improvement:
Current cache policy (the one provided by current version of serverless-image-handler), enables gzip compression which in turn adds the "accept-encoding" header to the cache key. But origin (resizing lambda) won't use it (am I missing something?) which makes sense because we are working with already compressed image-formats.
Notice that cloudfront enables auto compression only on image/svg+xml (see this) which also makes sense.
So, with current cache policy, two almost "equivalent" requests (from cloudfront's cache key perspective), except for the accept-encoding: gzip header presence, will generate different entries on the cache.

Please read this: "Cache Hit Ratio - Remove Accept-Encoding header when compression is not needed".

fvsnippets commented 2 years ago

Hi @ddonahue99 , I am seeing that TTL configuration is already one year when image on s3 doesn't provide one, and when s3 file provides a TTL then it will use that one. See this.

Cloudfront will honor cache-control header from origin when it provides one.

Maybe... am I missing something? Please give us some details (I am currently working on improve hit ratio too).

fvsnippets commented 2 years ago

Notice: making modifications to allow enabling Origin Shield optionally (enabled by parameter) on the solution, is a little complicate on the current CDK definition (or at least I can't figure out a simple way). But a solution's user could modify the provided template.yaml to add it. It's as simple as:

  BackEndImageHandlerCloudFrontApiGatewayLambdaCloudFrontToApiGatewayCloudFrontDistribution03AA31B2:
    Type: AWS::CloudFront::Distribution
    Properties:
      DistributionConfig:
        ...
        Origins:
          - CustomOriginConfig:
            ...
+           OriginShield:
+             Enabled: true
+             OriginShieldRegion: us-west-2
            ...

As a simple workaround, I would suggest to the mantainers to add it to the documentation.

fvsnippets commented 2 years ago

One other possible optimization [0].
The following applies when we have NOT enabled a Default Fallback Image.

Thumbor requests without id [1] (such as "/fit-in/120x120/") are forwarded to the origin.
The Lambda backend receives the request and processes it with a 500 error including a message "Expected uri parameter to have length >= 1, but found \"\" for params.Key". In the first place, I would like to say that I think this an erroneous behavior because it isn't an error from the processing lambda but from the requester. I think that it should be handled as a 400 or arguably a 404 status code [2]. But that's a topic for another issue.

The important thing here is that we could avoid making a request to the origin, just using a CloudFront Function (o a Lambda@Edge extension) matching the incorrect path.

I'll show an example:

+  BackEndCfnFunctionFB18E3BF:
+    Type: AWS::CloudFront::Function
+    Properties:
+      Name: fastNotFoundResponseFunction
+      AutoPublish: true
+      FunctionCode:
+        Fn::If:
+          - CommonResourcesEnableCorsConditionA0615348
+          - Fn::Join:
+              - ""
+              - - |-
+                  function handler(event) {
+                    // Notice: cannot modify body on fast responses from Cloudfront Functions. But we should be ok with that.
+
+                    if (event.request.method == 'GET') {
+                      var fastNotFoundPathsRegex = new RegExp('^/fit-in/[0-9]+x[0-9]+/?$');
+
+                      if (fastNotFoundPathsRegex.test(event.request.uri)) {
+                        return {
+                          statusCode: 404,
+                          statusDescription: 'Not Found',
+                          headers: {
+                            'content-type': { value: 'application/json' },
+                            'access-control-allow-methods': { value: 'GET' },
+                            'access-control-allow-headers': { value: 'Content-Type, Authorization' },
+                            'access-control-allow-credentials': { value: 'true' },
+                            'access-control-allow-origin': { value: '
+                - Ref: CorsOriginParameter
+                - |-
+                  ' }
+                          }
+                        };
+                      }
+                    }
+
+                    return event.request;
+                  }
+          - |-
+            function handler(event) {
+              // Notice: cannot modify body on fast responses from Cloudfront Functions. But we should be ok with that.
+
+              if (event.request.method == 'GET') {
+                var fastNotFoundPathsRegex = new RegExp('^/fit-in/[0-9]+x[0-9]+/?$');
+
+                if (fastNotFoundPathsRegex.test(event.request.uri)) {
+                  return {
+                    statusCode: 404,
+                    statusDescription: 'Not Found',
+                    headers: {
+                      'content-type': { value: 'application/json' },
+                      'access-control-allow-methods': { value: 'GET' },
+                      'access-control-allow-headers': { value: 'Content-Type, Authorization' },
+                      'access-control-allow-credentials': { value: 'true' }
+                    }
+                  };
+                }
+              }
+
+              return event.request;
+            }
+      FunctionConfig:
+        Comment: Returns not-found responses to some already know to be existent but frequently requested paths.
+        Runtime: cloudfront-js-1.0
      ...
  BackEndImageHandlerCloudFrontApiGatewayLambdaCloudFrontToApiGatewayCloudFrontDistribution03AA31B2:
    Type: AWS::CloudFront::Distribution
    Properties:
      DistributionConfig:
        ...
        DefaultCacheBehavior:
          ...
+          FunctionAssociations:
+            - EventType: viewer-request
+              FunctionARN:
+                Fn::GetAtt:
+                  - BackEndCfnFunctionFB18E3BF
+                  - FunctionARN
          ...

Notice:

Of course, this approach can also be applied to some other previously know to be always invalid/not existent (but frequently requested) paths.

As with optionally enabling Origin Shield (see my previous message), enabling this code based on EnableDefaultFallbackImageParameter value is a little complicated on the current CDK definition (or at least I can't figure out a simple way). But readers can make a custom modification of the base template.yaml.

[0] = being strict, the proposal here is not a cache optimization: CloudFront Functions are run before using cache. But might avoid making requests to the origin, which, in the end, achieves the same.
[1] = and maybe no-Thumbor requests too; I don't know because I don't use them, and I know almost nothing about them.
[2] = notice that current CloudFront's configuration caches 500 status code responses for ten minutes, whereas it caches 400/404 status code responses for only ten seconds.

fvsnippets commented 2 years ago

As a picture is worth a thousand words, these are my results (invocations on backend lambda) after applying all these things at the same time; sorry that was what I did, so I cannot show them one at a time.
That is:

cloudFrontCacheHit

Initial peak is attributable to old caches invalidation due to cache key conformation being modified.


But @ddonahue99 proposal of caching converted images on s3 would be a very important improvement, because CloudFront's caches (I understand that this applies to POPs, Regional and Shield caches) will discard less popular objects (please read this)

ddonahue99 commented 2 years ago

@fvsnippets It looks like you made a really meaningful dent, nice work and thank you for sharing all of your findings! I'm going to have to investigate the Accept-Encoding and fallback image tweaks on my end as well. It's been a while since we've revisited the configuration, but our hit ratio is still hovering around the low-to-mid 90s, so there's some more room for improvement.

If the AWS team is open to allowing for permanent caching in S3, I still agree that would have the biggest impact over the long-term. This solution is not the most efficient as-is for performance/cost at scale.

asgerjensen commented 2 years ago

maybe i'm wrong, but the main issue with doing straight serving from s3 in cloudfront is how to map the cloudfront cache key to a filename, especially when using things like AUTO_WEBP (and especially once AUTO_WEBP also does AUTO_AVIF ;)) without adding the runtime cost of another lambda edge call (time, and money).

I suppose it could be dealt with by the image handler, by having it check a CACHE_BUCKET once it has fully resolved all parameters, and immediatly prior to actually loading the image from the SOURCE BUCKET and performing operation.

if present, return it, as if it had been through the entire process, and if not, proceed and store the output to the CACHE_BUCKET.

it does mean, it will not do CDN => CACHE_S3? => API-GW, but instead CDN => API-GW => CACHE_S3, so you wont save on the api-gw calls, but you /will/ save on customer wait time for items that are already processed once.

fvsnippets commented 2 years ago

Maybe it could be enabled only under certain circumstances (AUTO_WEBP not enabled, etc) and only for certain paths (e.g. Thumbor resize URLs). I understand that CloudFront allows the latter by using origin groups (but I haven't read enough/have experience on that topic to tell for sure).

asgerjensen commented 2 years ago

I think my main concern is with not storing already processed items is, if i upload nice and juicy 10mb pngs as source images, it takes 5-10 seconds to turn it into an avif (after bumping sharp to .30 and adding it as a valid format) which is not going to be a smooth experience to the end user.

But honestly i have no idea what number of cache-evictions i would be looking at under normal circumstances (just started playing with this lib), but my site does have a few hundreds of thousands of images, and with 8 size variants for each, in 3 potential formats (avif, webp, jpg) it does add up, especially if it also adds a cachekey pr accept-header variant, (which for /some/ internet explorer/edge variants seem to include every office program installed)

If anyone has/is willing to share some experience on this, that would be great.

I was wondering if maybe a cloudfront function could be used to “normalize” the accept header into, maybe, only the optimal image/ prefix the client can understand, and use that as the cache key? (although that might break hmac validation?)

asgerjensen commented 2 years ago

For what its worth, i tried adding this to the backend-end-construct.ts


    // Add a cloudfront Function to normalize the accept header
    const normalizeAcceptHeaderFunction = new Function(this, 'Function', {
      functionName: `normalize-accept-headers-${Aws.REGION}`,
      code: FunctionCode.fromInline(`
            function handler(event) {
              if (event.request.headers && event.request.headers.accept && event.request.headers.accept.value) {
                var resultingHeader = "image/jpg";
                var acceptheadervalue = event.request.headers.accept.value;
                if (acceptheadervalue.indexOf('image/avif') > -1) {
                  resultingHeader = 'image/avif';
                } else if (acceptheadervalue.indexOf('image/webp') > -1) {
                  resultingHeader = 'image/webp';
                }
                event.request.headers.accept = { value: resultingHeader };
              }
              return event.request 
          }

      `),
    });

and wired it up further down

   const cloudFrontDistributionProps: DistributionProps = {
      comment: 'Image Handler Distribution for Serverless Image Handler',
      defaultBehavior: {
        origin: origin,
        compress: false,
        allowedMethods: AllowedMethods.ALLOW_GET_HEAD,
        viewerProtocolPolicy: ViewerProtocolPolicy.HTTPS_ONLY,
        originRequestPolicy: originRequestPolicy,
        cachePolicy: cachePolicy,
        functionAssociations: [{
          function: normalizeAcceptHeaderFunction,
          eventType: FunctionEventType.VIEWER_REQUEST,
        }]

And it does seem to work, for the AutoWebP scenario, where you just want to return the best possible representation the client can consume.

Ie

curl -H "accept: image/webp,image/gif" https://xxx.cloudfront.net/fit-in/800x800/sample-10mb.png -vvv --output /dev/null

gives a cache miss on first access (and hits afterwards) but

curl -H "accept: image/webp,image/jpg,image/*" https://xxx.cloudfront.net/fit-in/800x800/sample-10mb.png -vvv --output /dev/null

gives a cache-hit because the accept header is rewritten to just image/webp

Now, i realize this will probably conflict with other features, and request-specific requests for formats. Ie if you explicitly ask for a jpg in the transformations, it would cache it with an the image/web accept header, but....i suppose it will still actually RETURN content type image/jpg, and the filename/path part will already make it unique for requests that ask for transformation to jpg. Unsure if this is a problem, really...

Vadorequest commented 2 years ago

It would be super nice to have a comprehensive guide of things to do for people who are just getting started with "improving cache hit ratio", I can see several improvements are mentioned above, but I'm not sure how that should translate in "configuration updates". Could someone clarify if/what should be done?

dougtoppin commented 2 years ago

We will evaluate adding to the Implementation Guide some information on this subject.

github-actions[bot] commented 1 year ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

karensg commented 1 year ago

Hi AWS team,

I am bringing this task to your attention as I think it is an absolute must to improve the cache ratio. This task has been open for two years already and no steps have been taken to improve it. We have 50+ websites where we use this image handler and are running high costs because of this. In this task I read a lot of improvements from small to big and there are even many PR's ready to be checked like this one. Could you please prioritize this?

simonkrol commented 8 months ago

Hi Folks, As an update here, we've been looking to implement some of the improvements that have been found surrounding the cache hit ratio. Here are the statuses of the improvements @fvsnippets mentioned in this comment

Planned

Potential for future

Not Planned

Thanks for your interest in SIH, Simon

wonathanjong commented 4 months ago

hi everyone! I started working on S3 caching today using a hash of image request info as an additional key. It works by checking s3 in the lambda function before performing processing.

here's the basic approach: https://github.com/wonathanjong/sls-img-cache

let me know what y'all think :)

wonathanjong commented 4 months ago

Just made edits to forked repo