Website down - Githubissues

endeffects commented 1 year ago

@Glench Your Website is down with a nginx error

Glench commented 1 year ago

The site is back up now. I'm actively looking into causes of the issue and how to mitigate it. I'll update here with more information.

Glench commented 1 year ago

ExtensionPay has been experiencing stability issues every night at 1am. Sometimes these were for a few minutes but over the past few days it's just been constant. The nominal cause of the issue is the server running out of memory. I've increased the server memory which seems to have made the site stable. I'll continue to investigate to mitigate any further issues.

Glench commented 1 year ago

Turns out this has helped but not fixed the issue. We are continuing to investigate.

amonam commented 1 year ago

Just went down a few minutes ago.

Glench commented 1 year ago

We believe we have found the underlying issue. We have deployed a temporary fix that should help. More details to come.

Glench commented 1 year ago

The site has been stable for many hours now. During the instability users would sometimes receive slow responses or HTTP 499 errors. We'll have a breakdown of the issue, causes, and fixes soon.

Glench commented 1 year ago

The site has continued to be stable with no appreciable downtime for a couple days now.

Details Starting May 16, some short instabilities were detected in ExtensionPay's web services, mostly around 12:58am US/Eastern time. We upgraded our database backup software that seemed to be contributing to the instability and causing spikes in CPU and memory usage. We also optimized some database parameters to increase performance.

Even so, each night the length of these instabilities seemed to increase slightly in the range of 1am-1:30am until May 22 when they impacted service with extended instability through 9:30am. During this period, many clients received HTTP 499 or 500 errors or slow response times due to high CPU/memory usage of the server. Our response began around 6am. In order to mitigate instability and buy us time to investigate we upgraded the server which took only a few minutes of downtime and helped to stabilize the site.

We discovered a bug in our caching code that caused a memory leak that slowly maxed out server resources, which seemed to be the main cause of the instability. A short-term fix was deployed at 9:30am on May 22 and a permanent fix was deployed yesterday around 10am. Additionally, we deployed more database performance optimizations which have significantly reduced server resource usage.

There is a still a short period of instability (<1 minute) at 12:58am every night that we'll continue to investigate. Going forward, we now have more robust monitoring for instability issues as well as more automated testing for our caching layer.

endeffects commented 1 year ago

Thanks for Great work and the quick investigation.

Glench / ExtPay

Website down #137