badges / shields

Concise, consistent, and legible badges in SVG and raster format
https://shields.io
Creative Commons Zero v1.0 Universal
23.79k stars 5.5k forks source link

Intermittent 502 responses from camo.githubusercontent.com #3874

Closed adamjstone closed 5 years ago

adamjstone commented 5 years ago

Are you experiencing an issue with...

:beetle: Description

Badges intermittently fail to load. Periodically, I am seeing 502 responses from camo.githubusercontent.com for random badges. The behavior resolves itself, usually, after one or two page refreshes, which leads me to think there are timeouts happening somewhere which eventually resolve themselves via caching.

:link: Link to the badge

No specific badges are implicated in this. I see it across numerous badges on numerous pages, including the GitHub README for the badges/shields repo itself. See the screen captures below.


shields-repo-readme


si-repo-readme


shieds-response-info

adamjstone commented 5 years ago

May or may not be related to:

https://github.com/badges/shields/issues/1245

paulmelnikow commented 5 years ago

I'm seeing these too. They happen frequently while deploys are happening, and off and on and other times. To fix them for the deployment, we need to switch to a zero-downtime deployment. To fix them the rest of the time, I wonder if we need to add server capacity. We started adding a fourth server but that got held up.

I wonder if migrating to Heroku and bumping up our concurrency would solve both problems.

paulmelnikow commented 5 years ago

I keep seeing these, even while deploys aren't running. I wonder if we should experiment with adding a single dyno on Heroku to see if more capacity solves the problem.

paulmelnikow commented 5 years ago

Using Cloudflare with Heroku requires a CNAME so it's not an option to add a fourth server to our existing round-robin DNS.

We could run an experiment for several hours, though. We could set Heroku to four dynos, and point Cloudflare to that, and see how the reliability compares.

To use Heroku as a long-term option, I think there are three problems we'd need to solve:

  1. Decide who will have access to see the configuration and manage the deployments (needs to be more than one active maintainer, but probably should not be the entire core team) (#2577)
  2. Because the individual servers can't be reached externally, metrics will need to be generated on each server and sent to the metrics server. This SO post outlines two options: one using pushgateway, and the other scraping more frequently and including the $DYNO variable in the metrics.
  3. Decide whether we want to run servers in the EU region as well.
paulmelnikow commented 5 years ago

Now that the Heroku deploy issues are resolved, should we plan an experiment for one day next week, to see if adding more capacity resolves this issue?

calebcartwright commented 5 years ago

Now that the Heroku deploy issues are resolved, should we plan an experiment for one day next week, to see if adding more capacity resolves this issue?

I'm onboard with that. Do we know what our measurement and/or success criteria would be for such an experiment? Obviously we could do some eyeball comparisons but I think we'd ideally have some more quantifiable data to inform us.

I know some of these we won't realistically be able to capture for an experiment in the next couple days but I think some points of interest:

McLive commented 5 years ago

I also see those broken shields with 502 responses a lot.

paulmelnikow commented 5 years ago

Avg. response times (some related discussion on this in Discord)

  • A geo-distributed view of response times (see if experience is notably worse in certain parts of the world)

These seem the most important – wonder if there is a tool we could use for this!

We've got Nodeping and Uptime Robot – maybe the data from Nodeping is sufficient for this?

chris48s commented 5 years ago

I don't have a login for uptimerobot, but do we get more detailed stats than what's on https://status.shields.io/ from uptimerobot if we log in? Do we have response times, etc? Could we set up some additional checks to give us more stats for the purpose of this experiment?

We do also have some stats on grafana that could be useful to monitor: https://metrics.shields.io/d/g_1B7zhik/prom-client-default-metrics?orgId=1&refresh=10s https://metrics.shields.io/d/Bxu49QCmz/worldping-endpoint-summary?orgId=1&var-endpoint=img_shields_io&var-probe=All

Are the stats from nodeping exposed anywhere?

paulmelnikow commented 5 years ago

If you click one of the metrics you can get a response time graph: https://status.shields.io/779605524

Unfortunately we won't have Grafana stats for the trial, unless we tackle that first. Though maybe that would be a better way to get good comparisons.

Here are the public facing stats for NodePing:

https://nodeping.com/reports/status/PRPSX6LPW6 https://nodeping.com/reports/status/YBISBQB254

paulmelnikow commented 5 years ago

Ooph.

Screen Shot 2019-08-26 at 8 45 23 AM

@RedSparr0w pointed out that the stats for NodePing are much different from Uptime Robot. I think the main reason is that NodePing is configured for a 4 second timeout, which means responses that take longer than that count as failures.

matteocontrini commented 5 years ago

I saw this happening earlier today and opening the camo.githubusercontent.com URL in the browser returned a plain text page that said that the upstream returned 429.

EDIT: I'm actually unsure that the requests returned 502, but I'm certain about the 429 thing

paulmelnikow commented 5 years ago

Hmm, interesting.

I agree we should have some quantifiable statistics, but I don't want to hold this off too much longer. Can we try to get the stats teed up for Friday? If not, maybe Tuesday or Wednesday? (Monday is a U.S. holiday.)

calebcartwright commented 5 years ago

Can we try to get the stats teed up for Friday?

I can set aside some time over the next couple days to help work towards that. I'm πŸ‘ with trying something different, but I do think it's important to have clarity beforehand on how we will know whether the experiment was a success or failure.

PyvesB commented 5 years ago

Another thing we could temporarily do in the meantime would be to increase some of the default cacheLength values (the ones here). Better having slightly less up-to-date badges than none at all. Any thoughts?

paulmelnikow commented 5 years ago

This problem looks very similar to #1568. It was solved by adding caching, which reduced server load by 30%. I have suspected our capacity to be a bit on the low side for several months, but the process of adding a fourth server stalled several months ago. I suspect that continuing to tune the cache timeouts would help, at a cost to freshness.

We are experiencing a real problem which has a substantial negative impact on our users. On the one hand, it's useful to discover a metric that corresponds to the problem because it tells us what we need to monitor in the future in order to have a good result. It also helps us justify making the change permanently. On the other hand, I don't want to delay this process in order to design a perfect experiment. We know subjectively things are working poorly during peak times (weekday daytimes on the U.S. East Coast) and will have a subjective sense of what "better" is.

I think a good metric would be the onboard average response time: how long does it take from when we get a request until we start responding with a badge. Heroku provides this automatically, but we don't have a way to measure something similar in our current production environment.

We could add something similar to Prometheus, though we'd also need to let it run for a day or so, and also ideally track what it's doing when the badges are responding poorly. Ideally we'd also set up a way to send Prometheus metrics during the experiment so we can make a (mostly) apples-to-apples comparison.

calebcartwright commented 5 years ago

It also helps us justify making the change permanently

This was my main motivation.

On the other hand, I don't want to delay this process in order to design a perfect experiment. We know subjectively things are working poorly during peak times (weekday daytimes on the U.S. East Coast) and will have a subjective sense of what "better" is

I'm πŸ‘ on proceeding, and as mentioned previously can set aside some time to make myself available to assist. I'm not suggesting that we need to design a "perfect" experiment before starting either, just that we should state what the success criteria as part of this process.

Though I'd prefer some machine-measurable/quantifiable success criteria, if we can do an eyeball comparison (and given the current state issues we've been having, I actually suspect we can, at least to a degree) then that works too; just as long as we know what we're looking for.

We know subjectively things are working poorly during peak times (weekday daytimes on the U.S. East Coast) and will have a subjective sense of what "better" is.

SGTM!

I think a good metric would be the onboard average response time: how long does it take from when we get a request until we start responding with a badge.

Just to clarify, we'd use this as part of the determination for using Heroku long-term correct, not as part of the initial experiment?

calebcartwright commented 5 years ago

Perhaps we can also outline the items that need to get done to launch the experiment, figure out who's tackling what, and then pick a window/timeframe based on that.

I assume at an oversimplified level we'll:

Presumably we'll also need someone who has access to flip the routing of traffic back to our current/existing env in the event of any major issues on the Heroku side to be available throughout the experiment window

paulmelnikow commented 5 years ago

This was my main motivation.

πŸ‘Œ

I'd prefer to include a quantitative metric in the experiment, though in the interest of expediency perhaps we could do qualitative experiment instead.

We have some examples, including the Shields repo, of GitHub readmes where the badges normally work, but at peak times, show multiple broken badges. The hypothesis is that running four dynos on Heroku will prevent this behavior from being observed. We could do that for 1–2 days and ask folks in this thread to monitor several repos which have exhibited the problem. Then we could revert back to the original hosting and verify that the problems return.

If that gives the expected result, then we could stay on Heroku permanently, and start monitoring our response-time metrics, looking for a correlation the next time problems with load start appearing.

Most of the above list is done; see https://shields-production.herokuapp.com/index.html which has no metrics, but otherwise a compatible production env. I gave you access so you can see both the env vars and what can be monitored in the Heroku dashboard.

calebcartwright commented 5 years ago

I gave you access so you can see both the env vars and what can be monitored in the Heroku dashboard

Awesome! I was wondering about the token/cred aspect of the env in Heroku, both for services like Wheelmap/SymphonyInsight/etc. as well as the GitHub token pool.

Excellent that we've already gotten that sorted πŸ‘

calebcartwright commented 5 years ago

We could do that for 1–2 days and ask folks in this thread to monitor several repos which have exhibited the problem

Would also be really great to get feedback from non-US based folks during this window since the Heroku env. is entirely US based, correct?

paulmelnikow commented 5 years ago

Would also be really great to get feedback from non-US based folks during this window since the Heroku env. is entirely US based, correct?

Yes, the one that's set up now is entirely in the U.S.

paulmelnikow commented 5 years ago

I've added response-time metrics to this dashboard: https://metrics.shields.io/d/service-response-time/service-response-time?orgId=1&fullscreen&panelId=11

We're off peak time at this point, though the average response times are sometimes north of 1 second. It will be interested to see what the chart looks like when badges are missing!

If you notice a readme containing multiple missing badges, please post a screenshot at the time it's happening so we can cross-reference with this metric. If we find a correlation between service response time and the downtime that helps reinforce this is the metric we need to watch.

calebcartwright commented 5 years ago

23:35 UTC (from central US) - 2019 September 3rd image

PyvesB commented 5 years ago

9:17 UTC - 2019 September 4th

Screenshot 2019-09-04 at 10 16 34
PyvesB commented 5 years ago

13:26 UTC - 2019 September 4th

Screenshot 2019-09-04 at 14 26 41
paulmelnikow commented 5 years ago

Interesting and surprising patterns. At non-peak times, I'm seeing average response time fluctuating wildly. It's frequently over 1 second which I think would be a nice target, though we're not seeing massive numbers of requests over the 3 second cutoff, nor anything obvious in the graph during last night's outage @calebcartwright posted. On the other hand, this morning's outage @PyvesB is seeing corresponds with a single server running way over 2 seconds.

The charts during low usage times suggest there is probably a capacity problem. But there are two more serious observations.

One is that requests are not being evenly distributed across the servers. There seem to be cycles.. The server with all the high peaks s0 is also showing the slowest responses, however the cycle isn't correlated.

Also there's nothing in these two charts that really explains the drastic loss of badges on GitHub.

I have a new hypothesis about that. I suspect the 429 that was observed is Shields' onboard rate limiting getting triggered for the GitHub Camo IPs. That might explain the sharp loss and restoration of service that doesn't seem to be correlated with the response times in the way I expected. When one of the camo servers sends enough traffic to one server to trigger the rate limit, it will get 429's from that server and no badges until the rate limit resets. That the requests aren't being distributed evenly to the servers supports this – I think each camo instance, and every browser or proxy in general, likely resolves img.shields.io to a particular IP and then caches that for a little while.

McLive commented 5 years ago

14:09 UTC - 2019 September 4th image

paulmelnikow commented 5 years ago

We do have a block for what might be the IP addresses for camo though I can't find a list anywhere.

I'm using this endpoint: https://github.com/badges/shields/blob/a052d485fa1fba9d97e4acfde30cbb6c91f0f1b5/core/server/monitor.js#L81-L87

to identify additional IPs owned by GitHub which are generating a lot of traffic and I'll add them to the exclude list.

paulmelnikow commented 5 years ago

I've pushed this fix to production. 🀞that this will have fixed the issue.

jcbcn commented 5 years ago

Shields are now loading for me without issues. Thanks

EDIT: Maybe that was a bit premature. I'm still experiencing 502's but a lot less frequently.

adamjstone commented 5 years ago

Not seeing any issues on this end.


From: Jacob notifications@github.com Sent: Wednesday, September 4, 2019 10:19:05 AM To: badges/shields shields@noreply.github.com Cc: Adam Stone me@adamjstone.com; Author author@noreply.github.com Subject: Re: [badges/shields] Intermittent 502 responses from camo.githubusercontent.com (#3874)

Shields are now loading for me without issues. Thanks

β€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/badges/shields/issues/3874?email_source=notifications&email_token=ACBBMYMTQUBYOQIBHY47HH3QH7GWTA5CNFSM4IL56QRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD53542Q#issuecomment-527949418, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACBBMYPORWCTC7Y76H4HEQDQH7GWTANCNFSM4IL56QRA.

paulmelnikow commented 5 years ago

@jcbcn can you post screenshots when it happens?

jcbcn commented 5 years ago

@paulmelnikow Looks like I got unlucky and got a 504 on a mozilla observatory shield, which appears to be a one off. I can't reproduce any 502s now. So all looks good πŸ‘

paulmelnikow commented 5 years ago

Still seeing about 30% more traffic on s0 than s1 and s2 and peaks on the average response time graph for that server. So I think we need to add capacity, distribute our traffic more evenly, and perhaps consider a different solution for handling graphic distribution like geographic load balancing.

paulmelnikow commented 5 years ago

Though there are other issues to address here, the main issue appears to be solved so I'm going to close this for now. If this recurs after next week, please open a new issue. ❀️