Closed hobailey closed 8 years ago
@hobailey thank you for the report here, we hit this problem yesterday, but please guys... improve the status page so that it reflects these kind of things in real time. That way we wouldn't need to bother you by email when we find them, and we wouldn't need to wait for your confirmation to know if it's a problem on your side or with our code.
IMHO it doesn't make sense the API is returning 500 errors on sandbox right now for that endpoint and the status page doesn't say anything about it.
@javiercr I totally agree with you, and we should have reacted much quicker in this case - I'm sorry. We thought the supplier would have resolved it almost instantly (but it turns out their sandbox attitude isn't that of ours!). FYI, we only use the status page for major/critical outages (since it notifies absolutley everyone, and for this sort of problem that seems slightly over the top). Other than that, things will be posted in this repo :-)
I actually reported this at 3:30 BST on the 19th October. I've just tested it though and it appears to be now resolved.
Thanks @dannytip - the problem is indeed now resolved :-)
@hobailey I understand your point, but we've seen major outages (production env) not being reflected in the status page, or reflected after some time. I think that doesn't invalidate the point of having a real time status page with the actual status of the page. By real time I mean something like this:
https://status.github.com/ https://status.intercom.com/ https://status.twilio.com/
Ideally with the number/percentage of 5xx errors besides of response times. Right now there is no big difference for us between the status page or manually outage reports here or on Twitter.
As developers, when we find anomalies when talking with an external service, first thing we do is check the status page. If we need to wait for a manual update of that page, chances are we would have already sent emails or tweets to your support team complaining, which means extra unnecessary work for your team.
Thanks!
Hey @javiercr, I'm not sure it's fair to say
we've seen major outages (production env) not being reflected in the status page
We take production statuses very seriously, so if you feel there are particular cases where this hasn't been the case, please don't hesitate to highlight them. I do agree though that there should be the same transparency for the sandbox, and this is a work in progress :-)
Just to be clear though, in this case (where this a problem with a specific endpoint), I'm not sure how you could reliably have realtime updates - the examples you gave were indeed realtime, but generic for whole services, not individual components.
As for improved stats on the status page - this is something we're well aware of, and is planned for the near future!
@hobailey thank four the response and I'm glad to hear you guys are planning to improve the status page :)
Just to be clear though, in this case (where this a problem with a specific endpoint), I'm not sure how you could reliably have realtime updates
You could plot the percentage of 5xx error responses out of the total. And if that value is higher than X, then trigger an alarm. In the past you have created alerts such as this "unusual HTTP 500 error rate", that looks like something that could be automated and plotted into a chart in the status page.
OK, makes sense. We have these sort of alarms internally, but I see your point about being more transparent and linking that to the status page (especially in the case of sandbox, since we're already very reactive in production where a status would be posted 24/7, even if it is manual at this stage). All good feedback, thanks :-) And I assure we're working on improving these points!
Background
UserId
/bankaccounts/gb/ you will systematically receive an error 500Environments
Status