dotCMS / core

Headless/Hybrid Content Management System for Enterprises
http://dotcms.com
Other
853 stars 467 forks source link

'alive' and 'startup' healthcheck APIs return 503 on seemingly healthy app #28163

Closed yolabingo closed 2 months ago

yolabingo commented 6 months ago

Note: this is rather critical, as the current healthcheck config for k8s spams the dotcms log.

Problem Statement

We switched most cloud servers to use /api/v1/probes/startup and /api/v1/probes/alive for healthcheck endpoints. We observed that on some servers, these endpoints continually returned HTTP 503, even after the application was healthy and responsive. We also saw in the logs

WARN  monitor.MonitorHelper - Cache is failing: null

This impacted only a handful of environments, but the problem persisted for these. Anecdotally we saw it on 23.01 only. We reverted the healthchecks to other api endpoints to work around the issue.

Steps to Reproduce

unknown

Acceptance Criteria

healthcheck APIs are accurate

dotCMS Version

We saw it only on 23.01, not sure if it impacts other versions

Proposed Objective

Reliability

Proposed Priority

Priority 2 - Important

External Links... Slack Conversations, Support Tickets, Figma Designs, etc.

https://dotcms.slack.com/archives/G5VQBQ4H0/p1712369356322479

Assumptions & Initiation Needs

No response

Quality Assurance Notes & Workarounds

No response

Sub-Tasks & Estimates

No response

wezell commented 4 months ago

remove multithreading from system probes

The system probe code is prematurely optimized - it was trying to use circuit breakers and threadpools when building the dotCMS "am I alive" responses.

We just need to return if the system is up and working, without any of the multithreaded craziness.

dsilvam commented 2 months ago

PR: https://github.com/dotCMS/core/pull/29026

dsilvam commented 2 months ago

@bryanboza @josemejias11 the feedback/QA for this one will come from cloud-eng. cc: @yolabingo

yolabingo commented 2 months ago

We discussed adding a Postman check of the happy path here.

These endpoints are used to confirm the health of dotCMS containers using curl or k8s httpGet. The health checks send a plain GET request to these endpoints. The health check confirms the HTTP response code is 200.

yolabingo commented 2 months ago

Filing a new issue to add the Postman tests

jcastro-dotcms commented 2 months ago

INTERNAL QA: PASSED

The status for the http://localhost:8080/api/v1/probes/alive and http://localhost:8080/api/v1/probes/startup endpoints is returned as expected:

bryanboza commented 2 months ago

Fixed, tested on trunk // Postman

Postman test added on card: #29267