'Waiting for img.shields.io"

weitjong commented 9 years ago

After changing our project website to use shields.io, it frequently has long wait to render the badges. When we have problem in the badge rendering on our website, the http://shields.io/ also has the similar problem. So, I guess the problem is not on our side.

Is there any way to improve this? I am considering to switch back to use the original badges from their respective service providers. It is better to get a reliable low resolution version than an unreliable high resolution badge.

espadrine commented 9 years ago

I see you are using Coverity and Travis badges. I'll try to monitor their response times. Could you give me information about when they are not on par with what we should expect?

weitjong commented 9 years ago

Thanks for the prompt reply, much appreciated. I am sorry that I did not note down the exact time when it happened. What I can confirm is that when it did happen, the sample badges on the http://shields.io/ were also not rendered properly. I may also add that the outage did not take long to recover. It is the outage frequency that worries me. I happened to catch it with the pant down three or four times already since we switch to shields.io.

weitjong commented 9 years ago

It is happening now.

weitjong commented 9 years ago

It seems it has recovered itself again.

espadrine commented 9 years ago

Thanks a lot for reporting. There seems to have been a set of sudden surges in request frequency past 400 per seconds (from a normal 40) from servers at Amazon AWS. It has caused a lot of request time-outs, and has made Redis fail, which caused a marginal number of recovered crashes (about 50).

There was also a Heroku slow-down at the same time caused by our using slighly over 512MB of memory, which made things worse.

I think I plan on switching away from Heroku. We can hopefully use a lot more memory and take surges like these with a better infrastructure.

heroku

weitjong commented 9 years ago

Thanks for looking into this.

untitaker commented 9 years ago

I frequently experience this with the Gratipay shields.

ionelmc commented 9 years ago

This issue still happens

stephnr commented 9 years ago

It is happening currently with rendering badges

weitjong commented 9 years ago

This may sound stupid (or arrogant depend on how you read it), but I was wondering whether this issue exists before I raised it. More precisely, did it exist before our project website switch to use shields.io badges? Or it it just because I am the first who bother to report it?

I would not reveal the link to our project website here to avoid the impression that I am promoting it but I believe it should contribute to quite an amount of workload on your server because:

Our website has picked up a significant traffic recently.
The badges are located in our default footer page template, i.e. the badges appear in the footer of all the generated pages, including our documentation pages.
Before the switch, we have employed a small js to force browser to always download Travis-CI build status badges remotely instead of from local cache. This javascript is still in place after the switch.

With all these combined together, it means it does not matter where our website visitors navigate to, each page being rendered would generate a small workload on your server. The accumulated workloads could become significant depends on how well your server scale up on demand. Hence I am wondering could it be our own doing that cause this in the first place! I hope I am worrying too much tough.

espadrine commented 9 years ago

@weitjong Based on my server logs, there is no single website generating a majority of the traffic, although GitHub is the most significant. Issues related to the responsiveness of the server have existed as long as we have used the current configuration (see eg. https://github.com/badges/shields/issues/226), and have been solved by caches and algorithms so far. The main issue is that, as far as I know, the Heroku system doesn't give a way to the server software to detect when it enters slowdown mode. I will change server setup, however.

weitjong commented 9 years ago

Thanks for your prompt reply. Again much appreciated.

That is quite a relief to hear. I just want to come clean. :smile: Our website is hosted on github.io too. But I guess this statement does not change anything as you said the issue exists long before.

ms-ati commented 9 years ago

Just to add: the badges are still slow.

mjackson commented 8 years ago

I'm seeing timeouts on my Travis and npm badges today at https://github.com/rackt/history.

weitjong commented 8 years ago

The problem seems to be getting worse. I have temporarily switched our website to use back the lower res badge from its original source instead of from shields.io. It probably does not make any differences to shields.io performance, but who knows.

espadrine commented 8 years ago

Indeed, it won't make a difference. Unfortunately, the server became unreachable at the worst possible time — just as I started going to sleep. I'm investigating the issue with my hosting provider.

espadrine commented 8 years ago

I have just rebooted the server, and things are running again.

ms-ati commented 8 years ago

@espadrine have you considered hosting the images on S3 behind CloudFront? Couldn't you cache them there, if not actually render them to their as a static site as the primary representation?

espadrine commented 8 years ago

@ms-ati There is a difference between having slow badges, having downtime, and having incorrect badges.

I strive to produce badges that are as correct as possible, as fast as possible, and with as little downtime as possible.

I switched hosting providers a week ago to fix the speed issue; badges should now be exactly as slow as the service that produces the information they provide (plus network lag, which is generally negligible). More importantly, while before the server entered a severe slowdown for one hour every week at peak times, that should no longer happen. So far, that seems accurate.

Today, the VPS went down and did not restart, for which I am talking to the provider. Having a CloudFront cache would not help a server that is not running on a machine that is not up to serve images.

Cache really is not the issue right now. I have a cache that I use when vendors (Travis CI, etc) do not respond, when data changes rarely, or when the server receives duplicate requests in rapid fire.

So: are the badges still slow for you?

gnzlbg commented 8 years ago

Is it down again? :)

ms-ati commented 8 years ago

@espadrine Shouldn't a CloudFront or Fastly cache, simply in front of the badges system as a whole, smooth over any temporary outages? In other words, isn't HTTP caching itself a very well suited mechanism for increasing the availability of the badge urls?

espadrine commented 8 years ago

Yes, the server went dark again. I'm getting annoyed at OVH. I'd rather not spend another week of holidays setting things up yet again on a new server (say Digital Ocean), but two downtimes a week is obviously inacceptable. I sent them a mail, I will see what they say.

@ms-ati I use HTTP caching. Obviously, most people want badges to have accurate information, though. It is irrelevant anyway when you can't even access the IP that the DNS points your browser to.

tankerkiller125 commented 8 years ago

@espadrine Would it make more sense to use AWS and automatic scaling load balanced instances? Thus handling large sums of traffic quickly and easily according to your auto scaling policy. So that when say 80% of CPU is being used AWS will automatically spin up an exact copy of that instance and start sending traffic to it and the other instance?

ms-ati commented 8 years ago

Would it also make sense to put a CDN in front, which is configured to successfully return the last value when the origin is unreachable? This would seem a good case for that?

On Mon, Oct 5, 2015 at 10:29 AM, tankerkiller125 notifications@github.com wrote:

@espadrine https://github.com/espadrine Would it make more sense to use AWS and automatic scaling load balanced instances? Thus handling large sums of traffic quickly and easily according to your auto scaling policy. So that when say 80% of CPU is being used AWS will automatically spin up an exact copy of that instance and start sending traffic to it and the other instance?

— Reply to this email directly or view it on GitHub https://github.com/badges/shields/issues/445#issuecomment-145546410.

Marc Siegel Technology Specialist American Technology Innovations

Email: marc@usainnov.com Phone: 1 (617) 399-8145 Cell: 1 (617) 223-1220 Fax: 1 (617) 334-7975

espadrine commented 8 years ago

@tankerkiller125 Shields.io is not CPU-bound.

@ms-ati What is the difference between a CDN that returns the last value when the origin is unreachable, and a cache that does the same thing? We currently have the latter.

ms-ati commented 8 years ago

What is the difference between a CDN that returns the last value when the origin is unreachable, and a cache that does the same thing? We currently have the latter.

@espadrine Good question! I believe the difference is in the area of robustness. Or downtime, like this ticket discusses.

Using a cache as we have today impacts performance, and it may provide robustness against downtime of the data sources. However, as this ticket attests, it doesn't provide robustness for the shields.io service endpoints itself.

I think that using a CDN "in front" of shields.io, which is configured to return successfully with the last value for any url it has cached when the origin is down, will provide robustness to the service itself being down.

That way, badges that have previously been requested will continue to appear, and just won't be updated until the service is back up.

tankerkiller125 commented 8 years ago

@ms-ati So for example how Cloudflare has its "Always Online" feature where any page cached is displayed to the user?

@espadrine The CPU was just an example there are many different parameters that you can use to define the scaling groups.

espadrine commented 8 years ago

That is a fair distinction. I wonder if CloudFlare allows that kind of behaviour. I'll look into it.

I also wonder what HTTP status code I should send for invalid badges. 504 Gateway Timeout seems right?

tankerkiller125 commented 8 years ago

@espadrine Depending on the reason its invalid. If its invalid because say the project doesn't exist on the sources servers then it should be a 404. But if it is because shields.io is just unsuccessful reaching say for example github then it should return a 504.

ms-ati commented 8 years ago

@tankerkiller125 Yes, both CloudFlare and Cloud Front have support for this (IIRC).

@tankerkiller125 Agreed, 404 for a non-existent resource, 504 for temporarily unavailable

tankerkiller125 commented 8 years ago

I might also note that putting a CDN in front of shields.io will help decrease the load as well (because cached request are served from them instead of the server)

weitjong commented 8 years ago

If the badge is for the build status then serving a cache version might not be desirable, especially for active project.

ms-ati commented 8 years ago

@weitjong: Do prefer to have badges unavailable, rather than serving the last known version, when the service is down? That is what we are discussing.

Caching is used regardless, and the badges should play well with HTTP caching semantics. The only change I'm advocating is to use a CDN configuration to ensure that, when the endpoint is down, the cached badges still show.

The only change, in other words, is from (no badge) => (last badge) when service is down. Everything else, including timeliness of updates, would stay the same.

weitjong commented 8 years ago

I see. Thanks for your explanation. I did misunderstand it earlier. Having said that, I think I prefer "status unknown" or something like that instead of "last known".

espadrine commented 8 years ago

CloudFlare doesn't seem able to do what we want here. Always Online works through a weekly crawler, which would never see any of our users' badges. It only caches HTML, and up to 3 pages.

tankerkiller125 commented 8 years ago

Maybe some of us with spare computing power laying around (like I have several older XP machines) could donate that power by running the application and then giving you the IP so that you can add it to some sort of load-balancing proxy? This would also help with Githubs Rate Limiting because we would all have different Keys.

patrikhuber commented 8 years ago

I've had my badges repeatedly not loading over the past few days, and right now, img.shields.io seems to even have gone down. Is anybody else experiencing similar problems?

espadrine commented 8 years ago

It doesn't seem down, and the load average is pretty low (0.75, 0.79, 0.83). We might improve performance with node cluster, it's hard to tell before we do it.

@tankerkiller125 it's a nice idea! However, it would require all instances to be sysadmin'ed, as if m of n go down, m/n*100% of requests would fail with a DNS round robin. Also, they would all need very decent network IO speed.

tankerkiller125 commented 8 years ago

@espadrine If you ran a NGINX loadbalancer (not DNS roundrobin) it could verify and make sure that the server is online and you could also set weights (so that people with less network IO speed get less traffic than someone who has good network IO)

Its just a thought from a Sys Admin running a couple of decently large websites.

patrikhuber commented 8 years ago

I don't know, I get this very often, like more than 50% of the time (the screenshot is from right now):

Interestingly the travis badge loaded (which is also a shields.io badge - but it took half a second, while for example the appveyor badge loads instantly), the other ones did not.

It can't be my own ISP, can it?

Edit: Ok, the Chrome developer console shows these 404's:

GET https://camo.githubusercontent.com/f1b196b69bd0c027056a9b8dfe9329f74c0525fe…72696b68756265722f3464666163652e7376673f7374796c653d666c61742d737175617265 404 (Not Found) README.md:609
GET https://camo.githubusercontent.com/1f8b6bd33a7dfd3ed65e72cc1f23b87ca46ef325…72696b68756265722f3464666163652e7376673f7374796c653d666c61742d737175617265 404 (Not Found) README.md:612

That's something with GitHub's cache, isn't it? It seems to replace the shields.io URLs with these? And then they're not found...?

espadrine commented 8 years ago

@patrikhuber Hmm, the camo thing is weird. On principle, camo simply remembers the link between each id and the corresponding URL, and calls that URL. Can you give me the link to the readme, or the URLs?

@tankerkiller125 That would work, but it would cost more in time and money than having a simple DNS round robin between just two servers, which I haven't even scaled to yet.

patrikhuber commented 8 years ago

@espadrine Sure! All my repos are affected in fact. See for example the readme of https://github.com/patrikhuber/4dface.

Ah, and I don't think it has anything to do with my ISP, since I have the same problem at Uni and via 4G. Maybe I'm just too dumb to set the links correctly or made another mistake somewhere? (but that seems a bit odd since there's not actually much one can do wrong ;-) )

espadrine commented 8 years ago

@patrikhuber Ah never mind, the camo URLs didn't work for me because there is a … in the middle. In the console, they 404 because they went past camo's timeout limit. Badges that call GitHub's API respond slowly currently, probably because I reached their rate limit. I'll try to reach out to them.

patrikhuber commented 8 years ago

@espadrine Right, the Chrome developer console inserted the ... because the link is too long. However if you copy the correct URL by yourself, for example this, GitHub returns a "Not Found". Maybe that helps in your debugging.

tankerkiller125 commented 8 years ago

@espadrine I've done some more research on the DNS Round Robin. Because of the way clients work they will continue trying new IP's returned by the DNS untill they get a valid response from a server. In other words if there are 50 servers and 25 of them down users are going to get a response 100% of the time because of the way the client software works.

if one of the servers becomes unreachable, the client's web-browser will try the next IP address records returned by the DNS server and repeat the process to get the web-site.

I'm not sure how Githubs Camo service will handle it but I do know that a browser will be just fine (checked it out myself)

AshleyMedway commented 8 years ago

This is happen now. CloudFlare - HTTP error 522 - Connection timed out

patrikhuber commented 8 years ago

I think my issue is a different one. I'm still having this issue with camo.githubusercontent.com. It doesn't load my badges at all, see my repo https://github.com/patrikhuber/4dface. Any update on this, or anything I can do to help debug it?

hotrush commented 8 years ago

scr scr seems it just works when wants

tankerkiller125 commented 8 years ago

@hotrush I think the load that the servers are under is causing the request to API's to occur to slowly for the software. Causing the vendor unresponsive errors.

espadrine commented 8 years ago

Note: I changed to having two servers with DNS round-robin. We'll see if there is some improvements.

badges / shields

'Waiting for img.shields.io" #445