aws-solutions / serverless-image-handler

A solution to dynamically handle images on the fly, utilizing SharpJS
Apache License 2.0
1.32k stars 533 forks source link

Intermittent 500 errors after moving AWS accounts and Serverless Image Handler version #375

Closed amcfarlane closed 2 years ago

amcfarlane commented 2 years ago

Describe the issue We have transferred our site to a new AWS account. Which meant moving all the images into a new bucket and setting up Serverless Image Handler in CloudFormation. I think we have probably upgraded from v5.2.0 > v6.0.0. When we now load the website we get 500 errors from an intermittent number of images (within 'img' tags). The erroring images seem to lower over time. I.e. if we have 10 images on the page broken today, it could be 5 tomorrow.

If you copy one of these errored images you get the following:

ERRORED IMAGE

curl 'https://img2.picle.io/eyJlZGl0cyI6eyJyb3RhdGUiOm51bGwsInJlc2l6ZSI6eyJ3aWR0aCI6NjAwLCJoZWlnaHQiOjQ1MCwiZml0IjoiY292ZXIifX0sImJ1Y2tldCI6InByb2QuaW1nMi5waWNsZS5pbyIsImtleSI6ImltYWdlc1wvWW5QcDFLMkpcLzI3ZTAxOWVlLTMwY2ItNDU0ZS05ODAxLWZhMzJlOTdkYTE1OSJ9' \
  -H 'authority: img2.picle.io' \
  -H 'accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8' \
  -H 'accept-language: en-US,en;q=0.9,nb;q=0.8,it;q=0.7,la;q=0.6' \
  -H 'cache-control: no-cache' \
  -H 'pragma: no-cache' \
  -H 'referer: https://picle.io/' \
  -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: image' \
  -H 'sec-fetch-mode: no-cors' \
  -H 'sec-fetch-site: same-site' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36' \
  --compressed

This returns:

{"message": "Internal server error"}

But if we load the same image URL directly within a browser we get a correct 200 image:

curl 'https://img2.picle.io/eyJlZGl0cyI6eyJyb3RhdGUiOm51bGwsInJlc2l6ZSI6eyJ3aWR0aCI6NjAwLCJoZWlnaHQiOjQ1MCwiZml0IjoiY292ZXIifX0sImJ1Y2tldCI6InByb2QuaW1nMi5waWNsZS5pbyIsImtleSI6ImltYWdlc1wvWW5QcDFLMkpcLzI3ZTAxOWVlLTMwY2ItNDU0ZS05ODAxLWZhMzJlOTdkYTE1OSJ9' \
  -H 'authority: img2.picle.io' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'accept-language: en-US,en;q=0.9,nb;q=0.8,it;q=0.7,la;q=0.6' \
  -H 'cache-control: no-cache' \
  -H 'pragma: no-cache' \
  -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36' \
  --compressed

Note, after a while both the above will probably start working. I'm not sure if its an import issue, it could take a while to re-process all the images again.

Version: v6.0.0 Region: us-east-1 Was the solution modified from the version published on this repository? No.

fisenkodv commented 2 years ago

@amcfarlane thank you for reporting the issue. Do you have any cache? It seems after some period of time, when the cache invalidated it starts working.

amcfarlane commented 2 years ago

@fisenkodv thanks for responding. I’m not sure about the cache. The whole setup, apart from the existing S3 images is standard. We didn’t add any extra caching on top of the default Serverless Image Handler setup.

fisenkodv commented 2 years ago

I suppose you have the Route 53 since your domain is img2.picle.io which might point to the new deployment and if TTL wasn't reached you can see the error, this is my assumption.

amcfarlane commented 2 years ago

I was thinking similar. But all images are loading from img2.picle.io and only 20% are failing. I wouldn't imagine the browser would use different DNS routes for 1 pageload? But I might be wrong.

fisenkodv commented 2 years ago

@amcfarlane still see the issue? Have you tried to open links in different browsers, have you tried to clean the DNS cache?

amcfarlane commented 2 years ago

@fisenkodv thanks for your ongoing support with this. Unfortunately we have an hopefully temporary issue with our entire AWS account at the moment which is making debugging impossible. Before the issues we where experiencing the issue on other browsers and devices. But I can debug more once we are back up and running.

amcfarlane commented 2 years ago

@fisenkodv the AWS issues have been resolved and the site is back up again. Still the same issue I'm afraid. The DNS is strange as we updated the domain from img.picle.io > img2.picle.io for the transfer, so it's a new domain. You can see the intermittent missing images on this page:

https://picle.io/games/ghost-of-tsushima

It feels like we need to "warm the server" up my parsing all the images slowly or something after the transfer. It feels like re-loading all the images at once causes a server to overload or something, then the 500 error is cached for a while. But thats obviously wrong as its a serverless setup...

fisenkodv commented 2 years ago

@amcfarlane I've opened the link in the Google Chrome and I can see images, the same in the Safari. But in the Firefox I don't see all images, but if I open a direct image link in the Firefox it works fine. I think it might be the browser specific. Also, do you have any CORS settings?

amcfarlane commented 2 years ago

@fisenkodv The CORS settings are below. I still see intermittent images in Chrome.

Screenshot 2022-06-28 at 20 49 52
fisenkodv commented 2 years ago

@amcfarlane what I've noticed is when I open https://picle.io/games/ghost-of-tsushima link in the Firefox some images aren't loading, e.g. https://img2.picle.io/eyJlZGl0cyI6eyJyb3RhdGUiOm51bGwsInJlc2l6ZSI6eyJ3aWR0aCI6ODAwLCJoZWlnaHQiOjgwMCwiZml0IjoiY292ZXIifX0sImJ1Y2tldCI6InByb2QuaW1nMi5waWNsZS5pbyIsImtleSI6ImVudHJpZXNcLzVMN3JwSkthXC9lODM0YWNkNS1lZjkxLTQ2YTQtYTdkNS01YTU5M2YwZjg2NzYifQ== and in the browser's network I see 500 error(if you open that link now, you will be able to see that image, please continue reading). For that image Firefox sends

curl --location --request GET 'https://img2.picle.io/eyJlZGl0cyI6eyJyb3RhdGUiOm51bGwsInJlc2l6ZSI6eyJ3aWR0aCI6ODAwLCJoZWlnaHQiOjgwMCwiZml0IjoiY292ZXIifX0sImJ1Y2tldCI6InByb2QuaW1nMi5waWNsZS5pbyIsImtleSI6ImVudHJpZXNcLzVMN3JwSkthXC9lODM0YWNkNS1lZjkxLTQ2YTQtYTdkNS01YTU5M2YwZjg2NzYifQ==' \
--header 'Host: img2.picle.io' \
--header 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Firefox/91.0' \
--header 'Accept: image/webp,*/*' \
--header 'Accept-Language: en-US,en;q=0.5' \
--header 'Accept-Encoding: gzip, deflate, br' \
--header 'Connection: keep-alive' \
--header 'Referer: https://picle.io/' \
--header 'Cookie: _rdt_uuid=1656342238487.792d85fd-b5ab-483a-8a17-8225ac9015de; _tt_enable_cookie=1; _ttp=94baabd5-cd11-45cb-b1d6-6e2e290e1744; _ga_SZGQSD8XXQ=GS1.1.1656342238.1.1.1656342476.0; _ga=GA1.1.2070033236.1656342239; _fbp=fb.1.1656342239846.1350367669; __stripe_mid=0379a393-9b1d-4924-9578-c9e9d120bceaec6ccd' \
--header 'Sec-Fetch-Dest: image' \
--header 'Sec-Fetch-Mode: no-cors' \
--header 'Sec-Fetch-Site: same-site' \
--header 'Pragma: no-cache' \
--header 'Cache-Control: no-cache' \
--header 'TE: trailers'

then, I've tried to open the same page in the Chrome and I was able to see that image. When I compared requests I noticed that the Chrome sends a bit different Accept header namely image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8 instead image/webp,*/*, i.e. in the Chrome I was able to see the image, and that image is JPEG image, not WebP. When I updated the page in the Firefox, I saw that image as well, since it was cached by CloudFront and it explains why you see images after some period of time.

Taking into account the aforementioned, could you please double-check why the Firefox sends that value in the Accept header.

amcfarlane commented 2 years ago

@fisenkodv I'm seeing the issue also in Chrome. In my original post I specified a similar difference within the Accept headers.

500 Error -

-H 'accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8' \

200 Success -

-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \

But my issue here is that these images are within a simple html tag. The browser will decide the headers it wants to send for the images? I'm not sure I can amend them? or I'm missing something.

fisenkodv commented 2 years ago

@amcfarlane could you please check the following:

My concern is even if I unable to see an image in a browser I can see the image if open the image link directly (image's src attribute).

amcfarlane commented 2 years ago

@fisenkodv I thought that might be the issue as well, as we don't have file extensions. But as the image loads eventually it makes me think this isn't the issue. But I'll check for those errors now. Also, this wasn't present on our older v5.2.0 setup.

After a long search I did find the following within the API-Gateway-Execution-Logs/image which looks like it might fit the problem. There are roughly the same number of errors as matching 500 images. Also the return 500 status is the same.

Lambda invocation failed with status: 429.

Execution failed due to configuration error: Rate Exceeded.

Method completed with status: 500

I'm pretty new to these services and CloudWatch so it took me a while to find, my apologises.

amcfarlane commented 2 years ago

@fisenkodv Thanks again for your help with this. Finally a solution. 🎉

Our old AWS account has the default quota of concurrent Lambda executions of 1000. The new account had 10. 😅 So on pages of a large number of images after the 10 concurrent was reached we got the 500 error. This was then cached for a period of time by CloudFront. So we were basically rendering 10 (probably more) images on each view of the page.

Asking AWS to increase our concurrent Lambda executions to 1000 has fixed the issue. Once all the new images are cached again this won't be an issue.

It might be helpful to add a note about this to the Serverless Image Handler Implementation Guide. Does anyone know who I can contact about that?

fisenkodv commented 2 years ago

@amcfarlane thank you for the update! Also, wanted to share this link https://aws.amazon.com/premiumsupport/knowledge-center/lambda-troubleshoot-throttling/ which could be useful. We will add updating the Serverless Image Handler Implementation Guide to our backlog. Please feel free to close the issue.

iuliuvisovan commented 2 years ago

@fisenkodv Thanks again for your help with this. Finally a solution. 🎉

Our old AWS account has the default quota of concurrent Lambda executions of 1000. The new account had 10. 😅 So on pages of a large number of images after the 10 concurrent was reached we got the 500 error. This was then cached for a period of time by CloudFront. So we were basically rendering 10 (probably more) images on each view of the page.

Asking AWS to increase our concurrent Lambda executions to 1000 has fixed the issue. Once all the new images are cached again this won't be an issue.

It might be helpful to add a note about this to the Serverless Image Handler Implementation Guide. Does anyone know who I can contact about that?

Hey, how did you request an increase of the concurrent Lambda executions quota? Is there an interface for this? Can't seem to find it. Thanks.

fisenkodv commented 2 years ago

@iuliuvisovan, there is the interface to do that, please refer to https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html.

dougtoppin commented 2 years ago

It appears that this is no longer a pending issue. Please re-open the issue if there is still a concern.