Closed adamjstone closed 5 years ago
May or may not be related to:
I'm seeing these too. They happen frequently while deploys are happening, and off and on and other times. To fix them for the deployment, we need to switch to a zero-downtime deployment. To fix them the rest of the time, I wonder if we need to add server capacity. We started adding a fourth server but that got held up.
I wonder if migrating to Heroku and bumping up our concurrency would solve both problems.
I keep seeing these, even while deploys aren't running. I wonder if we should experiment with adding a single dyno on Heroku to see if more capacity solves the problem.
Using Cloudflare with Heroku requires a CNAME so it's not an option to add a fourth server to our existing round-robin DNS.
We could run an experiment for several hours, though. We could set Heroku to four dynos, and point Cloudflare to that, and see how the reliability compares.
To use Heroku as a long-term option, I think there are three problems we'd need to solve:
$DYNO
variable in the metrics.Now that the Heroku deploy issues are resolved, should we plan an experiment for one day next week, to see if adding more capacity resolves this issue?
Now that the Heroku deploy issues are resolved, should we plan an experiment for one day next week, to see if adding more capacity resolves this issue?
I'm onboard with that. Do we know what our measurement and/or success criteria would be for such an experiment? Obviously we could do some eyeball comparisons but I think we'd ideally have some more quantifiable data to inform us.
I know some of these we won't realistically be able to capture for an experiment in the next couple days but I think some points of interest:
I also see those broken shields with 502 responses a lot.
Avg. response times (some related discussion on this in Discord)
- A geo-distributed view of response times (see if experience is notably worse in certain parts of the world)
These seem the most important β wonder if there is a tool we could use for this!
We've got Nodeping and Uptime Robot βΒ maybe the data from Nodeping is sufficient for this?
I don't have a login for uptimerobot, but do we get more detailed stats than what's on https://status.shields.io/ from uptimerobot if we log in? Do we have response times, etc? Could we set up some additional checks to give us more stats for the purpose of this experiment?
We do also have some stats on grafana that could be useful to monitor: https://metrics.shields.io/d/g_1B7zhik/prom-client-default-metrics?orgId=1&refresh=10s https://metrics.shields.io/d/Bxu49QCmz/worldping-endpoint-summary?orgId=1&var-endpoint=img_shields_io&var-probe=All
Are the stats from nodeping exposed anywhere?
If you click one of the metrics you can get a response time graph: https://status.shields.io/779605524
Unfortunately we won't have Grafana stats for the trial, unless we tackle that first. Though maybe that would be a better way to get good comparisons.
Here are the public facing stats for NodePing:
https://nodeping.com/reports/status/PRPSX6LPW6 https://nodeping.com/reports/status/YBISBQB254
Ooph.
@RedSparr0w pointed out that the stats for NodePing are much different from Uptime Robot. I think the main reason is that NodePing is configured for a 4 second timeout, which means responses that take longer than that count as failures.
I saw this happening earlier today and opening the camo.githubusercontent.com
URL in the browser returned a plain text page that said that the upstream returned 429.
EDIT: I'm actually unsure that the requests returned 502, but I'm certain about the 429 thing
Hmm, interesting.
I agree we should have some quantifiable statistics, but I don't want to hold this off too much longer. Can we try to get the stats teed up for Friday? If not, maybe Tuesday or Wednesday? (Monday is a U.S. holiday.)
Can we try to get the stats teed up for Friday?
I can set aside some time over the next couple days to help work towards that. I'm π with trying something different, but I do think it's important to have clarity beforehand on how we will know whether the experiment was a success or failure.
Another thing we could temporarily do in the meantime would be to increase some of the default cacheLength
values (the ones here). Better having slightly less up-to-date badges than none at all. Any thoughts?
This problem looks very similar to #1568. It was solved by adding caching, which reduced server load by 30%. I have suspected our capacity to be a bit on the low side for several months, but the process of adding a fourth server stalled several months ago. I suspect that continuing to tune the cache timeouts would help, at a cost to freshness.
We are experiencing a real problem which has a substantial negative impact on our users. On the one hand, it's useful to discover a metric that corresponds to the problem because it tells us what we need to monitor in the future in order to have a good result. It also helps us justify making the change permanently. On the other hand, I don't want to delay this process in order to design a perfect experiment. We know subjectively things are working poorly during peak times (weekday daytimes on the U.S. East Coast) and will have a subjective sense of what "better" is.
I think a good metric would be the onboard average response time: how long does it take from when we get a request until we start responding with a badge. Heroku provides this automatically, but we don't have a way to measure something similar in our current production environment.
We could add something similar to Prometheus, though we'd also need to let it run for a day or so, and also ideally track what it's doing when the badges are responding poorly. Ideally we'd also set up a way to send Prometheus metrics during the experiment so we can make a (mostly) apples-to-apples comparison.
It also helps us justify making the change permanently
This was my main motivation.
On the other hand, I don't want to delay this process in order to design a perfect experiment. We know subjectively things are working poorly during peak times (weekday daytimes on the U.S. East Coast) and will have a subjective sense of what "better" is
I'm π on proceeding, and as mentioned previously can set aside some time to make myself available to assist. I'm not suggesting that we need to design a "perfect" experiment before starting either, just that we should state what the success criteria as part of this process.
Though I'd prefer some machine-measurable/quantifiable success criteria, if we can do an eyeball comparison (and given the current state issues we've been having, I actually suspect we can, at least to a degree) then that works too; just as long as we know what we're looking for.
We know subjectively things are working poorly during peak times (weekday daytimes on the U.S. East Coast) and will have a subjective sense of what "better" is.
SGTM!
I think a good metric would be the onboard average response time: how long does it take from when we get a request until we start responding with a badge.
Just to clarify, we'd use this as part of the determination for using Heroku long-term correct, not as part of the initial experiment?
Perhaps we can also outline the items that need to get done to launch the experiment, figure out who's tackling what, and then pick a window/timeframe based on that.
I assume at an oversimplified level we'll:
Presumably we'll also need someone who has access to flip the routing of traffic back to our current/existing env in the event of any major issues on the Heroku side to be available throughout the experiment window
This was my main motivation.
π
I'd prefer to include a quantitative metric in the experiment, though in the interest of expediency perhaps we could do qualitative experiment instead.
We have some examples, including the Shields repo, of GitHub readmes where the badges normally work, but at peak times, show multiple broken badges. The hypothesis is that running four dynos on Heroku will prevent this behavior from being observed. We could do that for 1β2 days and ask folks in this thread to monitor several repos which have exhibited the problem. Then we could revert back to the original hosting and verify that the problems return.
If that gives the expected result, then we could stay on Heroku permanently, and start monitoring our response-time metrics, looking for a correlation the next time problems with load start appearing.
Most of the above list is done; see https://shields-production.herokuapp.com/index.html which has no metrics, but otherwise a compatible production env. I gave you access so you can see both the env vars and what can be monitored in the Heroku dashboard.
I gave you access so you can see both the env vars and what can be monitored in the Heroku dashboard
Awesome! I was wondering about the token/cred aspect of the env in Heroku, both for services like Wheelmap/SymphonyInsight/etc. as well as the GitHub token pool.
Excellent that we've already gotten that sorted π
We could do that for 1β2 days and ask folks in this thread to monitor several repos which have exhibited the problem
Would also be really great to get feedback from non-US based folks during this window since the Heroku env. is entirely US based, correct?
Would also be really great to get feedback from non-US based folks during this window since the Heroku env. is entirely US based, correct?
Yes, the one that's set up now is entirely in the U.S.
I've added response-time metrics to this dashboard: https://metrics.shields.io/d/service-response-time/service-response-time?orgId=1&fullscreen&panelId=11
We're off peak time at this point, though the average response times are sometimes north of 1 second. It will be interested to see what the chart looks like when badges are missing!
If you notice a readme containing multiple missing badges, please post a screenshot at the time it's happening so we can cross-reference with this metric. If we find a correlation between service response time and the downtime that helps reinforce this is the metric we need to watch.
23:35 UTC (from central US) - 2019 September 3rd
9:17 UTC - 2019 September 4th
13:26 UTC - 2019 September 4th
Interesting and surprising patterns. At non-peak times, I'm seeing average response time fluctuating wildly. It's frequently over 1 second which I think would be a nice target, though we're not seeing massive numbers of requests over the 3 second cutoff, nor anything obvious in the graph during last night's outage @calebcartwright posted. On the other hand, this morning's outage @PyvesB is seeing corresponds with a single server running way over 2 seconds.
The charts during low usage times suggest there is probably a capacity problem. But there are two more serious observations.
One is that requests are not being evenly distributed across the servers. There seem to be cycles.. The server with all the high peaks s0 is also showing the slowest responses, however the cycle isn't correlated.
Also there's nothing in these two charts that really explains the drastic loss of badges on GitHub.
I have a new hypothesis about that. I suspect the 429 that was observed is Shields' onboard rate limiting getting triggered for the GitHub Camo IPs. That might explain the sharp loss and restoration of service that doesn't seem to be correlated with the response times in the way I expected. When one of the camo servers sends enough traffic to one server to trigger the rate limit, it will get 429's from that server and no badges until the rate limit resets. That the requests aren't being distributed evenly to the servers supports this β I think each camo instance, and every browser or proxy in general, likely resolves img.shields.io to a particular IP and then caches that for a little while.
14:09 UTC - 2019 September 4th
We do have a block for what might be the IP addresses for camo though I can't find a list anywhere.
I'm using this endpoint: https://github.com/badges/shields/blob/a052d485fa1fba9d97e4acfde30cbb6c91f0f1b5/core/server/monitor.js#L81-L87
to identify additional IPs owned by GitHub which are generating a lot of traffic and I'll add them to the exclude list.
I've pushed this fix to production. π€that this will have fixed the issue.
Shields are now loading for me without issues. Thanks
EDIT: Maybe that was a bit premature. I'm still experiencing 502's but a lot less frequently.
Not seeing any issues on this end.
From: Jacob notifications@github.com Sent: Wednesday, September 4, 2019 10:19:05 AM To: badges/shields shields@noreply.github.com Cc: Adam Stone me@adamjstone.com; Author author@noreply.github.com Subject: Re: [badges/shields] Intermittent 502 responses from camo.githubusercontent.com (#3874)
Shields are now loading for me without issues. Thanks
β You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/badges/shields/issues/3874?email_source=notifications&email_token=ACBBMYMTQUBYOQIBHY47HH3QH7GWTA5CNFSM4IL56QRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD53542Q#issuecomment-527949418, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACBBMYPORWCTC7Y76H4HEQDQH7GWTANCNFSM4IL56QRA.
@jcbcn can you post screenshots when it happens?
@paulmelnikow Looks like I got unlucky and got a 504 on a mozilla observatory shield, which appears to be a one off. I can't reproduce any 502s now. So all looks good π
Still seeing about 30% more traffic on s0 than s1 and s2 and peaks on the average response time graph for that server. So I think we need to add capacity, distribute our traffic more evenly, and perhaps consider a different solution for handling graphic distribution like geographic load balancing.
Though there are other issues to address here, the main issue appears to be solved so I'm going to close this for now. If this recurs after next week, please open a new issue. β€οΈ
Are you experiencing an issue with...
:beetle: Description
Badges intermittently fail to load. Periodically, I am seeing 502 responses from camo.githubusercontent.com for random badges. The behavior resolves itself, usually, after one or two page refreshes, which leads me to think there are timeouts happening somewhere which eventually resolve themselves via caching.
:link: Link to the badge
No specific badges are implicated in this. I see it across numerous badges on numerous pages, including the GitHub README for the badges/shields repo itself. See the screen captures below.