Open mnaydan opened 8 months ago
When I checked the logs on the two VMs running the PPA web application, I couldn't find any errors in either the nginx log or the django logs. Francis reminded me that I can use datadog to look at traffic going to the loadbalancer, and once we figured out the correct way to filter to just requests going to PPA, we could see a large number of warnings and errors with the 502 gateway error (which is an upstream timeout, meaning the ppa application somehow isn't responding in time). All of the errors I saw in the datadog logs were triggered by bots crawling the site, with the malformed urls with multiple clusters. This was what triggered the decision to update production with the fixes for the cluster search urls.
I've been keeping an eye on datadog today, and those errors are trialing off significantly compared to the steady stream of them we were seeing last week. Hopefully, this means the problem is fixed - but even if it isn't, it should now be much easier to find any actual errors in the logs because it won't be buried by all the cluster url problems.
Francis said he would do the work to get our application logs included in datadog so that I can look at them there, rather than having to log in and look at two different VMs. I think there may a related stop / possible blocker of getting CDH ansible scripts running in ansible tower, which we want for other reasons anyway.
The number of errors I'm seeing in datadog logs has gone down significantly, but not gone away entirely.
Here's the error/warn incident graph for the past 15 days:
And here's the error/warn incident graph for the past 7 days:
When I inspect manually, nearly all the errors I look at are those bogus multi-cluster urls.
@rlskoeser do the errors at the current level seem to be causing problems for users? Anecdotally, I haven't encountered the server error page when I've been on the PPA recently but I haven't been doing data work for long stretches like I was before. Maybe it's something to monitor when I do my next round of data work.
@mnaydan my hope is that the significantly lower level of errors is enough for the us to stop seeing these. It should continue to trail off, since the bots crawling the site will no longer be queuing up these bogus cluster urls to crawl. All the errors I see in datadog now are triggered by bots, so hopefully no users are seeing the error at this point.
I agree that we should continue to keep an eye on it - you as you use the site, and I'll keep glancing at datadog occasionally.
I wish I had been more careful about the scale of errors before - we're still getting errors but the scale is much lower. The two-week graph I posted on March 7 had a max of 2k on the y-axis; the one I generated today has a max of 500.
This is a generic error page from the library that appears randomly (no pattern detected) on the frontend and backend during navigation of the site. I encountered it a lot during the Brogan data work, and we discovered that at least some of the errors are 502 bad gateway errors. We made a patch release in which we restarted both servers, but that didn't solve the problem. Francis and Alicia then suggested another patch release to stop the bots from crawling the aggregated cluster URLs, as they were causing lots of errors and crowding the logs. We did that, but we may need to do another patch release to redirect the aggregated cluster URLs if traffic to those URLs doesn't decrease. Anecdotally, I haven't encountered the error page today as I've been navigating, but I'm not spending as much time on it as I was during the data work. @rlskoeser please add or amend anything I missed.