Closed svonworl closed 3 months ago
A couple of thoughts before my 3 day weekend. :)
- How will this be invoked? Via Uptime Robot as had been discussed in Slack?
- Any concerns about false positives on the ES check? Lags in indexing, multiple containers, it seems possible that the DB counts and ES counts could temporarily be out of sync, but they would sync eventually. I know we typically don't have enough publishing activity where this is an issue, but maybe it could be some day (or there's a .dockstore.yml that publishes/unpublishes 32 workflows). Maybe that's why you have the
4
threshold in the other PR?
See description of https://github.com/dockstore/dockstore-deploy/pull/762
Attention: Patch coverage is 91.11111%
with 4 lines
in your changes are missing coverage. Please review.
Project coverage is 74.52%. Comparing base (
aaaf076
) to head (f3f570c
). Report is 2 commits behind head on develop.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Issues
4 New issues
0 Accepted issues
Measures
0 Security Hotspots
91.2% Coverage on New Code
0.0% Duplication on New Code
Description This PR adds health checks to the webservice that detect when:
Kathy convinced me that these are indeed health checks, so they're run and reported via the existing
/metadata/health
endpoint and associated machinery. They do differ from some of the existing health checks: although they signal a condition that's not entirely healthy, their failure indicates a non-fatal condition, and the webservice should continue to run, it need not be stopped/replaced/etc. That's ok, because currently, our monitoring software only replaces the webservice task when theconnectionPool
health check fails.We calculate how long the Liquibase lock has been held by comparing the current time against when it was last granted, per the database table. If the lock has been held more than 10 minutes, we declare it held too long.
Initially, I tried to manage the required
Sessions
"manually" viaSessionFactory.openSession
andManagedSessionContext.bind
. However, for unknown reasons, this screwed up other Sessions in subsequent unrelated requests, causing them to malfunction withIllegalStateExceptions
etc. So, instead, I usedUnitOfWorkAwareProxyFactory
to wrap thecheck()
methods, which is cleaner and worked as advertised. I cribbed the subsequently-rejected manual session management code from https://github.com/dockstore/dockstore/blob/develop/dockstore-webservice/src/main/java/io/dockstore/webservice/DockstoreWebserviceApplication.java#L526, so its continued presence worries me a little.When a health check fails, the resource method logs an
ERROR
level message containing the health check name. We use this log entry to create a Cloudwatch alarm in companion PR https://github.com/dockstore/dockstore-deploy/pull/762Review Instructions Trigger the exceptional conditions on qa and confirm that the alarms happen.
Issue https://ucsc-cgl.atlassian.net/browse/SEAB-6225 https://ucsc-cgl.atlassian.net/browse/SEAB-4825
Security and Privacy
No unusual concerns.
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install
@RolesAllowed
annotation