Closed AlexWayfer closed 6 years ago
Same for all my shield.io-driven badges - error 503 (timeout)
Just my luck, I switched to shields.io badges a few days ago...
Problem occurs on all servers and applies to dynamic and static badges. Response times increased. https://status.shields-server.com/
Im getting 521 Web Server Is Down
status codes for badges when i visit shields.io
Looks like the server is going down every few minutes
Is no-one there to give that server a kick?
As far as I know only @espadrine has access to servers.
Unfortunately not, I think @espadrine is the only person with access currently.
Here's a tweet I sent: https://twitter.com/Shields_io/status/942763063270412288
I tried mitigating the issue in a few ways, including passing CloudFlare in danger mode and rate-limiting per IP. I am gathering information to see how to best mitigate the issue.
I think things are better now. The volume of requests we receive is still very high, but the rate limiting protects legitimate users.
Also, I believe the DoS has stopped, about an hour after I started activating rate limiting. We are back to a 200 req/s average, from about 1000 that we had during the day.
(It is difficult to determine exactly who was the bad actor, because unsurprisingly most serious offenders are AWS servers, and some of those servers are legit. There are other smaller offenders, like the Moscow Youth Autonomous Non-Commercial Organisation Home Computer Network, but they aren't as big.)
I had to whitelist GitHub because otherwise it got all of its IPs banned one by one.
I relaxed the rate limit from 50 req/s to 100 req/s this morning, and the load shot from 200 req/s to 400 req/s. The servers can handle up to about 500 req/s, as the graph shows, so my guess is that whoever changed something on Monday did not manually turn it off, they simply respect the Retry-After header, which I indeed pushed to production about at the same time as the issues stopped.
A side-effect of the current rate limit is that the front page won't fully load (your IP will be banned after the first 100 badges show up on the page, for the remainder of the minute). I will experiment with whether we can afford to have an hourly limit instead, or whether we need a combination of both, at the end of my workday.
Again :scream:
Yep, I saw that. It is strange, because it is such a sudden increase.
After looking at the URLs associated with the AWS IPs I flag, I feel like it does not seem related to one given IP address, which probably means it is a large website that started automatically adding badges to its pages. The additional load definitely comes from the US, however.
So, instead of flagging IPs, I decided to flag badge types. I limited badge types similarly (with progressive tweaking); right now it is at a max of 300 hits every 600 seconds. Here are an example list of flagged badge types:
[
"npmv",
"npmdm",
"githubstars",
"wordpressplugin",
"githubrelease",
"githubforks",
"githubissues",
"wordpressv",
"pypiv",
"codecovc",
"nugetv",
"appveyorci",
"gitterroom",
"githublicense",
"npml",
"npmdt",
"badgehttp2",
"badgeipv6",
"twitterfollow",
"githubdownloads"
]
I don't know which one is suddenly more popular than it should, but the badgehttp2 and the badgeipv6 are certainly surprising to me.
Whatever is hitting our servers seems to have stopped again:
The start and stop are as sharp as they were yesterday, but at different hours. It is very puzzling to me.
We will probably need more investigation to survive tomorrow, when they start hitting us again.
@espadrine, thank you! Good luck!
Shields.io is a good service, and it's very sad that someone started to harm it.
Thanks for your work on this!
After looking at the URLs associated with the AWS IPs I flag, I feel like it does not seem related to one given IP address, which probably means it is a large website that started automatically adding badges to its pages. The additional load definitely comes from the US, however.
Is there any useful info in the Referer
header?
To everyone: Shields gets by with a tiny server and hosting budget. Most of the time this works okay but sometimes things like this happen! Your $10 goes a long way in helping us strengthen and toughen the service.
If you ❤️ Shields, please consider becoming a backer with a one-time $10 donation.
@paulmelnikow I'd love to contribute, and I will. But can anyone share a rough estimate of the cost involved to keep this running? Also, did I see correctly you are using a VPS? Would something more scalable in the cloud not be a more affordable solution?
@kbrandwijk The server costs are about $17/month with 3 servers ; they'd be about $23/month if I added another server.
Cloud providers are a double-edged sword; they definitely can be cheap (although it's hard to beat a VPS given our requirements), but costs are hard to assess and can explode without us noticing.
There is another venue that I try to explore: optimizing the code. Last time I optimized it, I switched the bottleneck to become text width computation, which typically hits 15-20ms. There is quite a bit of caching above it, but it is still the bottleneck, and intuitively there is no reason it cannot be cut down with a smarter algorithm.
What's your current bandwidth? I've had very good experiences with the global CDN + load balancing deployment that Zeit Now offers. Also, I'll have a look at the source code later on. Do you have any detailed metrics in the tests already?
If you ❤️ Shields, please consider becoming a backer with a one-time $10 donation.
I would like to make $1–3 donations every month. Please, consider this possibility (via Patreon, for example).
@AlexWayfer It's possible via OpenCollective as well, shields just needs to define other options (Most projects have a $2/month option).
@kbrandwijk Thanks so much for your donation!
@AlexWayfer That would be great! You can choose monthly, and enter the amount you'd like on this page: https://opencollective.com/shields/donate
@paulmelnikow The donate page has a minimum of $10. That's why I mentioned adding some options yourselves.
I'm not able to set less than $10 using stepper arrows but I can type something less than $10. But I don't know if it's possible to make a donation with such amount of money.
Don't think it allows less than $10 even if you have typed less as it still says $10 down the bottom: Alternatively you could do $12 yearly for essentially $1 monthly
Ah, I gotcha. Sure! I added a $3/month option.
I started rate-limiting by referrer, as suggested by @paulmelnikow. Obviously, we are not currently being hit by whatever was pressuring us, and we are very clearly in the safe zone (until tomorrow morning?), but there is one notable (albeit small) referrer that keeps getting temporarily banned.
This Chrome extension seems to open a tab on a given URL, and that URL contains a handful of badges. I reached out to the author in an issue.
Ah, I gotcha. Sure! I added a $3/month option.
@paulmelnikow, are you sure? Input has min="10"
even for monthly
option.
@AlexWayfer it's here: https://opencollective.com/shields/
@AlexWayfer it's here: https://opencollective.com/shields/
Oh, sorry. Thank you!
Small update: today we are seemingly not hit by the DDoS.
Here is the monthly look:
In terms of performance for text width computation, #1298 adds a cache layer that divides by two the number of width calculations on average for each generated badge. We should deploy it as soon as possible if we want to benefit from a small performance boost. Ill also look into #1379 in the coming days/weeks to see if things can be further improved. :wink:
Also, I noticed that only the text-width calculation for the left side is cached. Is that deliberate? Also, since the cache item size is so small, I would consider setting the max to a lot more than 1000.
@kbrandwijk : I'm not convinced a bigger cache size would change much. The left hand-side doesn't feature that many different values when you look at shields' homepage (probably around 100 different possibilities). That leaves a big margin for custom badges with unique new left hand-side keys, which probably represent a smaller number of users anyway. ^^
@PyvesB I have no idea why I proposed that. Your explanation makes complete sense...
1298 adds a cache layer
pdfkit has added a word cache at the start of the year, which is in v0.8.2. We are currently in v0.8.3 according to the package-lock.json. I would expect those caches to overlap; did you notice a significant speedup on average across 10k requests?
@espadrine : I did not realise they had their own caching system, I was not expecting that from such a library. I did some quick testing at the time of the pull request and I did notice close to 10% when repeatedly calling makeBadge
(averaged on way over 10k iterations). Nevertheless, this was done on an environment very different to what shields is running on (different operating system, different Node version and old laptop), so it would probably benefit from a closer look if you think there may be overlapping. 😉
I haven't actually looked in the pdfkit caching, but one issue with letting the library do the caching is that it won't discriminate between left hand-sides which are very static and only have a small number of possibilities, and right hand-sides which are very variable. In my small test, I did use random strings for the right hand-side texts, which may explain why adding this extra layer of cache helped as it only caches what we know is likely to be requested again soon.
Let's continue the optimization discussion here! https://github.com/badges/shields/issues/1379#issuecomment-353475495
Badges periodically and randomly don't load :(
I saw this on January 11:
And today (January 16):
And everything is normal now. Strange things.
And again:
I have been noticing the same thing with a lot of badges, but no certain badges in particular, My best guess would be something to do with the rate limiting. I have also seen quite a few lately.
Looks like the server has been failing a lot more often the past couple days also (although s1 has had 100% uptime the past 5 days)
Example of repo: https://github.com/AlexWayfer/flame
Example of a cached badge by GitHub: https://camo.githubusercontent.com/a6493fc03433558a4434b0089c399ab02fb47c79/68747470733a2f2f696d672e736869656c64732e696f2f636f6465636c696d6174652f6d61696e7461696e6162696c6974792f416c65785761796665722f666c616d652e7376673f7374796c653d666c61742d737175617265