dockstore / dockstore

Our VM/Docker sharing infrastructure and management component
https://dockstore.org/
Apache License 2.0
116 stars 27 forks source link

SEAB-6225/4825: Add Liquibase lock and Elasticsearch consistency checks #5843

Closed svonworl closed 3 months ago

svonworl commented 3 months ago

Description This PR adds health checks to the webservice that detect when:

Kathy convinced me that these are indeed health checks, so they're run and reported via the existing /metadata/health endpoint and associated machinery. They do differ from some of the existing health checks: although they signal a condition that's not entirely healthy, their failure indicates a non-fatal condition, and the webservice should continue to run, it need not be stopped/replaced/etc. That's ok, because currently, our monitoring software only replaces the webservice task when the connectionPool health check fails.

We calculate how long the Liquibase lock has been held by comparing the current time against when it was last granted, per the database table. If the lock has been held more than 10 minutes, we declare it held too long.

Initially, I tried to manage the required Sessions "manually" via SessionFactory.openSession and ManagedSessionContext.bind. However, for unknown reasons, this screwed up other Sessions in subsequent unrelated requests, causing them to malfunction with IllegalStateExceptions etc. So, instead, I used UnitOfWorkAwareProxyFactory to wrap the check() methods, which is cleaner and worked as advertised. I cribbed the subsequently-rejected manual session management code from https://github.com/dockstore/dockstore/blob/develop/dockstore-webservice/src/main/java/io/dockstore/webservice/DockstoreWebserviceApplication.java#L526, so its continued presence worries me a little.

When a health check fails, the resource method logs an ERROR level message containing the health check name. We use this log entry to create a Cloudwatch alarm in companion PR https://github.com/dockstore/dockstore-deploy/pull/762

Review Instructions Trigger the exceptional conditions on qa and confirm that the alarms happen.

Issue https://ucsc-cgl.atlassian.net/browse/SEAB-6225 https://ucsc-cgl.atlassian.net/browse/SEAB-4825

Security and Privacy

No unusual concerns.

Please make sure that you've checked the following before submitting your pull request. Thanks!

svonworl commented 3 months ago

A couple of thoughts before my 3 day weekend. :)

  • How will this be invoked? Via Uptime Robot as had been discussed in Slack?
  • Any concerns about false positives on the ES check? Lags in indexing, multiple containers, it seems possible that the DB counts and ES counts could temporarily be out of sync, but they would sync eventually. I know we typically don't have enough publishing activity where this is an issue, but maybe it could be some day (or there's a .dockstore.yml that publishes/unpublishes 32 workflows). Maybe that's why you have the 4 threshold in the other PR?

See description of https://github.com/dockstore/dockstore-deploy/pull/762

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 91.11111% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 74.52%. Comparing base (aaaf076) to head (f3f570c). Report is 2 commits behind head on develop.

Files Patch % Lines
...resources/ElasticsearchConsistencyHealthCheck.java 88.46% 1 Missing and 2 partials :warning:
...webservice/resources/LiquibaseLockHealthCheck.java 90.90% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #5843 +/- ## ============================================= + Coverage 74.46% 74.52% +0.06% - Complexity 5248 5260 +12 ============================================= Files 366 368 +2 Lines 18975 19018 +43 Branches 2021 2025 +4 ============================================= + Hits 14130 14174 +44 + Misses 3888 3883 -5 - Partials 957 961 +4 ``` | [Flag](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | Coverage Δ | | |---|---|---| | [bitbuckettests](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `27.10% <35.55%> (+0.01%)` | :arrow_up: | | [integrationtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `58.49% <91.11%> (+0.09%)` | :arrow_up: | | [languageparsingtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `11.00% <35.55%> (+0.05%)` | :arrow_up: | | [localstacktests](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `21.55% <35.55%> (+0.03%)` | :arrow_up: | | [toolintegrationtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `30.46% <35.55%> (+0.01%)` | :arrow_up: | | [unit-tests_and_non-confidential-tests](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `28.90% <35.55%> (+0.01%)` | :arrow_up: | | [workflowintegrationtests](https://app.codecov.io/gh/dockstore/dockstore/pull/5843/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore) | `38.70% <35.55%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dockstore#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sonarcloud[bot] commented 3 months ago

Quality Gate Passed Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
91.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud