OpenLiberty / ci.docker

Eclipse Public License 1.0
44 stars 59 forks source link

docker "defaultServer" passes health checks, when application is still starting up #567

Open lancechant opened 3 months ago

lancechant commented 3 months ago

Hi

We have a server on an instance that has limited resources so it takes a bit to start up, and we have noticed we get healthchecks that "pass" and say the server is up, when in fact it's not our server, and it comes up later.

The docker image passes health checks here: [AUDIT ] CWWKF0012I: The server installed the following features: [appSecurity-5.0, beanValidation-3.0, cdi-4.0, connectors-2.1, distributedMap-1.0, enterpriseBeansLite-4.0, enterpriseBeansPersistentTimer-4.0, expressionLanguage-5.0, jdbc-4.3, jndi-1.0, json-1.0, jsonb-3.0, jsonp-2.1, jwt-1.0, localConnector-1.0, mdb-4.0, messaging-3.1, microProfile-6.1, monitor-1.0, mpConfig-3.1, mpFaultTolerance-4.0, mpHealth-4.0, mpJwt-2.1, mpMetrics-5.1, mpOpenAPI-3.1, mpRestClient-3.0, mpTelemetry-1.1, restfulWS-3.1, restfulWSClient-3.1, servlet-6.0, ssl-1.0, transportSecurity-1.0]. [AUDIT ] CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 43.494 seconds

But we have a log line a few lines below that showing the application is not started yet [AUDIT ] CWWKZ0022W: Application test-app has not started in 30.000 seconds.

At this point the docker image passes health checks while the server is still starting up, we also have custom health checks in the server, but those are not triggered at this point, as the only response we get is status UP, then it will go to down status and back to showing us our custom health check responses.

Is this intended behaviour? and what can we do to get around this?

leochr commented 3 months ago

Hi @lancechant,

Thank you for reaching out.

Please provide the below information to help with our investigation:

  1. Are you encountering this while building your application container image? or while running the built image?
  2. How are the containers deployed? If it's stand-alone/manually then are you using Docker Healthcheck? or if they are deployed on Kubernetes/OpenShift then are you defining any probes (readiness, liveness or startup)? Please provide the details of healthcheck or probes
  3. Are you using/referring to the status returned by the MicroProfile Health endpoint (/health)?
  4. Is this intermittent? or do you constantly observe the behaviour?
lancechant commented 3 months ago

Hi @leochr here are the responses:

  1. Running the built image
  2. They are standalone, each are run as separate containers on the system, also I've noticed it happens during our builds as well during intergration tests, but the docker container is built and running as you would normally with a docker run.
  3. Yes the status of the microProfile health check, all the endpoints seem to be giving it, health, started, ready.
  4. It seems pretty consistent, manually running the container locally with simulared limited resources gave the same result everytime
leochr commented 3 months ago

@lancechant Thank you for the additional details.

Could you please provide the server log? We are interested to see the events leading up to the startup of the Liberty server and the application including the various endpoints. You can trim out the rest.

Are you defining custom health checks by implementing the interface org.eclipse.microprofile.health.HealthCheck from MicroProfile Health? If so, could you please provide details of those (Readiness, Liveness, Startup)? Are they part of the same application?

lancechant commented 2 months ago

docker-logs.txt

Hi @leochr Attached are the logs from the docker container, tried to leave as much there as possible, and just removed some information, I hope it helps.

Yes we are, and it's a dependency from another lib we have that is brought in with Maven, the health check is annotated with the Liveness and readiness annotations. And they are just checking for example connections to the database and MQ are live and working well

lancechant commented 2 months ago

Hey @leochr just checking if you've made any headway with this?

leochr commented 1 month ago

@lancechant Sorry for the delayed response. I missed the notification for your inquiry.

I investigated the logs but I couldn't find anything that could explain the.

Please share more details about the health check. Thank you.

  1. Is the docker run command used with --health-cmd ? Could you please provide the full command (in particular the command options for starting container and health check)?
  2. Is any startup delay defined to account for the long time it takes for the application to start? For example, --health-start-period allows to delay the initial health check, so that health checks are not performed pre-maturely.
  3. What's the output of the health check when it reports up/ready vs down/not-ready?
  4. Noticed the following in the logs. Are they part of the health check? Are they part of the MicroProfile Health definition? or something else?
    06:22:50.845   [Default Executor-thread-17] DEBUG z.c.s.t.a.h.b.JmsBrokerHealthCheck - Scanning for supported message brokers...
    06:22:50.847   [Default Executor-thread-17] DEBUG z.c.s.t.a.h.b.JmsBrokerHealthCheck - Found the following JMS broker endpoints: [active-mq-broker]
    06:22:50.854   [Default Executor-thread-37] DEBUG z.c.s.t.a.h.b.DefaultHealthCheckSwitch - Using the default enabler
    06:22:50.854   [Default Executor-thread-2] DEBUG z.c.s.t.a.h.b.DefaultHealthCheckSwitch - Using the default enabler
    06:22:50.854   [Default Executor-thread-17] DEBUG z.c.s.t.a.h.b.DefaultHealthCheckSwitch - Using the default enabler
    06:22:50.854   [Default Executor-thread-39] DEBUG z.c.s.t.a.h.b.DefaultHealthCheckSwitch - Using the default enabler
    06:22:50.945   [Default Executor-thread-39] DEBUG z.c.s.t.p.p.j.c.DBController - Executing a JPortal function within za.co.sbg.App.message.obc.entity.persistence.ObcToggles
    06:22:50.945   [Default Executor-thread-17] DEBUG z.c.s.t.p.p.j.c.DBController - Executing a JPortal function within za.co.sbg.App.message.obc.entity.persistence.ObcToggles
    06:22:50.949   [Default Executor-thread-17] DEBUG z.c.s.t.a.h.b.JmsBrokerHealthCheck - All the health check switches are in the on position
    06:22:50.949   [Default Executor-thread-39] DEBUG z.c.s.t.a.h.b.JmsBrokerHealthCheck - All the health check switches are in the on position
lancechant commented 1 month ago

Hi @leochr thanks for getting back, no worries I'm sure you have many things to work on.

  1. No it isn't, it's called separately via the server or AWS loadbalancer or via the integration tests on start up.
  2. There is a timeout period that will fail if it doesn't get a 200 within the time limit, but the thing is because it does get the 200 and then a 504 later causes the issue, especially during testing because it's marked as healthy and then tests start to fail because it is not actually up
  3. If I recall when the initial 200 is shown it just shows that it's the basic health checks and not the custom ones we had (I'll confirm again if needed, but I need to artificially slow down the docker image to ensure I get the correct outputs)
  4. Yes they are part of our custom health checks, they do implement the MicroProfile Health definitions. Below is one of the health checks we have
@Slf4j
@Liveness
@Readiness
@ApplicationScoped
public class JmsBrokerHealthCheck implements HealthCheck {

    @Inject
    private Configuration platformConfiguration;

    @Inject
    private DynamicResourceProvider dynamicResourceProvider;

    @Inject
    @JmsBrokerSwitches
    private Instance<HealthCheckSwitch> jmsBrokerHealthCheckSwitchInstances;//There should always be at least one of these.

    private Set<String> brokerJndiSet;

    @PostConstruct
    public void initialize() {

        log.debug("Scanning for supported message brokers...");

        brokerJndiSet = platformConfiguration.getAllMessageRoutes().stream()
                .filter(r -> r.getMessageTransport() == null || JMS.equalsIgnoreCase(r.getMessageTransport()))
                .map(Endpoint.Route::getEndpointName)
                .collect(Collectors.toSet());

        log.debug("Found the following JMS broker endpoints: {}", brokerJndiSet);
    }

    @Override
    public HealthCheckResponse call() {

        if (shouldCheck()) {
            return checkBrokers();
        } else {
            return HealthCheckResponse.named(getClass().getSimpleName())
                    .withData("Broker Check Off", "JMS health checks are disabled, Status forced to UP")
                    .up().build();
        }
    }

    private boolean shouldCheck() {
        Optional<HealthCheckSwitch> firstOffSwitch = jmsBrokerHealthCheckSwitchInstances.stream()
                .filter(healthCheckSwitch -> !healthCheckSwitch.shouldCheck())//Essentially ANDing them all together
                .findFirst();
        if (firstOffSwitch.isEmpty()) {
            log.debug("All the health check switches are in the on position");
            return true;
        } else {
            log.debug("At least one health check switch is in the off position");
            return false;
        }
    }

    private HealthCheckResponse checkBrokers() {
        HealthCheckResponseBuilder healthCheckResponseBuilder = HealthCheckResponse.named(getClass().getSimpleName());

        boolean allBrokersAreValid = true;

        for (String jmsJndi : brokerJndiSet) {
            try {

                ConnectionFactory connectionFactory =
                        dynamicResourceProvider.lookup(
                                JmsMessageProducer.CONNECTION_FACTORY_PREFIX + jmsJndi);
                try (var conn = connectionFactory.createConnection()) {

                    healthCheckResponseBuilder = healthCheckResponseBuilder
                            .withData(jmsJndi, "Successfully connected to broker");
                }

            } catch (Exception e) {
                allBrokersAreValid = false;
                healthCheckResponseBuilder = healthCheckResponseBuilder
                        .withData(jmsJndi,
                                "Connection to broker was unsuccessful: " + e.getMessage());
                log.error("Error connecting to broker "+jmsJndi,e);
            }
        }

        if (allBrokersAreValid) {
            return healthCheckResponseBuilder.up().build();
        }

        return healthCheckResponseBuilder.down().build();
    }

}
lancechant commented 2 weeks ago

Hey @leochr a small observation I noticed, and not to sure how the OL docker image works.. but I say it showed the defaultServer started, but then a few lines later says the actual application we copied over hasn't started yet, this is when the health check of just up shows, so is there something we missing in the docker container creation? below is the example docker file we have

FROM open-liberty:24.0.0.2-kernel-slim-java17-openj9
ENV KEYSTORE_REQUIRED=false VERBOSE=true
EXPOSE 8080 8543
COPY --chown=1001 dependencies /opt/ol/wlp/usr/
COPY --chown=1001 maven /config/
COPY --chown=1001 keystore /var/keystore/
RUN features.sh
RUN configure.sh