CardanoSolutions / ogmios

❇️ A WebSocket JSON/RPC bridge for Cardano
https://ogmios.dev
Mozilla Public License 2.0
304 stars 90 forks source link

[IDEA] - Add a health metrics if the node-socket is reachable at all #154

Closed gitmachtl closed 2 years ago

gitmachtl commented 2 years ago

Describe your idea, in simple words.

Running for example node 1.33.0 in P2P mode with

"DiffusionMode": "InitiatorOnly",

in the config will not create a local listening port anymore. So we can't use cardanoPing/cncli to check if the node is alive.

If such a node stops to work or was shutdown, there is currently no flag for that in the ogmios health check:

curl -s http://127.0.0.1:1337/health | jq
{
  "startTime": "2021-12-25T10:16:39.579348019Z",
  "lastKnownTip": {
    "slot": 48861271,
    "hash": "333b265fc2a34f230f0f7a579e76fb0c841be11549832f290e77822cbbe0fec2",
    "blockNo": 6672097
  },
  "lastTipUpdate": "2021-12-25T10:19:22.281947735Z",
  "networkSynchronization": 1,
  "currentEra": "Alonzo",
  "metrics": {
    "activeConnections": 0,
    "totalConnections": 0,
    "totalUnrouted": 0,
    "sessionDurations": {
      "mean": 0,
      "min": 0,
      "max": 0
    },
    "runtimeStats": {
      "currentHeapSize": 209,
      "gcCpuTime": 1240003707,
      "cpuTime": 1722094732,
      "maxHeapSize": 325
    },
    "totalMessages": 0
  }
}

Thats a sample output after the node was shut down.

So using the health metrics, there is only one way currently to see if the node is really ok by comparing the lastKnownTip with the theoretical calculated one from the genesis files and do a threshold if it falls too far behind.

The Error-Log is showing a warning like:

{"severity":"Warning","timestamp":"2021-12-25T10:37:23.904043804Z","thread":"7","message":{"Health":{"tag":"HealthFailedToConnect","socket":"/home/.../db/node.socket","retryingIn":5}},"version":"v5.0.0"}

"networkSynchronization": 1, also stays on 1(=100%).

Why is it a good idea?

It would be nice to have a flag that can show if the current connection to the node via the node socket is ok or not. We get error outputs in the logs, but not on the health check here.

KtorZ commented 2 years ago

Good point. Note that the last know tip also contains a UTC timestamp so, in principle, this is "enough" to know in it's starting to drift, albeit not practical.

It's also unfortunate that the network synchronization is only updated on every new tip, while simple, it means that the value is only refreshed when the connection is up. Perhaps having a background thread to create artificial ticks would be better here.

gitmachtl commented 2 years ago

Would be possible to set "networkSynchronization": null, if there is no socket connection to the node? This would also handle the start up condition if ogmios is started before the node, reporting a networkSynchronization of 0% in that case is not 100% correct. Reporting a nullwould cover it, because "we don't know" the value at that state.

redoracle commented 2 years ago

what about implementing it in the docker images of ogmios as healthcheck.sh script?

Currently neither curl nor jq are installed on the docker image.

KtorZ commented 2 years ago

@redoracle -> implementing what exactly in the docker image :thinking: ?

redoracle commented 2 years ago

I meant implementing the healthcheck.sh script as usual docker images do in order to verify the container is running properly otherwise the healthcheck script will trigger the container restart.

by using this command : curl -s http://127.0.0.1:1337/health | jq I guess it is possible to verify some of the metrics to understand if the ogmios container is running properly.

Alternatively I can create one and map it inside the container, but at least I need preinstalled: curl and jq, in order to make it work.

attached here an example of a container with health-check and one without.

Screen Shot 2021-12-30 at 4 42 01 PM
KtorZ commented 2 years ago

Seems like this can work nicely with just wget as follows:

HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]
KtorZ commented 2 years ago

Note: I've started re-working the docker images recently to avoid having to maintain two build systems. The new images are based on the Nix build and make heavy use of the caching:

#  This Source Code Form is subject to the terms of the Mozilla Public
#  License, v. 2.0. If a copy of the MPL was not distributed with this
#  file, You can obtain one at http://mozilla.org/MPL/2.0/.

#                                                                              #
# ------------------------------- SETUP  ------------------------------------- #
#                                                                              #

FROM nixos/nix:2.3.11 as build

RUN echo "substituters = https://cache.nixos.org https://hydra.iohk.io" >> /etc/nix/nix.conf &&\
    echo "trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY= hydra.iohk.io:f/Ea+s+dFdN+3Y/G+FDgSq+a5NEWhJGzdjvKNGv0/EQ=" >> /etc/nix/nix.conf

WORKDIR /app
RUN nix-shell -p git --command "git clone --depth 1 https://github.com/input-output-hk/cardano-configurations.git"

WORKDIR /app/ogmios
RUN nix-env -iA cachix -f https://cachix.org/api/v1/install && cachix use cardano-ogmios
COPY . .
RUN nix-build -A ogmios.components.exes.ogmios -o dist
RUN cp -r dist/* . && chmod +w dist/bin && chmod +x dist/bin/ogmios

#                                                                              #
# --------------------------- BUILD (ogmios) --------------------------------- #
#                                                                              #

FROM busybox as ogmios

ARG NETWORK=mainnet

LABEL name=ogmios
LABEL description="A JSON WebSocket bridge for cardano-node."

COPY --from=build /app/ogmios/bin/ogmios /bin/ogmios
COPY --from=build /app/cardano-configurations/network/${NETWORK} /config

EXPOSE 1337/tcp
STOPSIGNAL SIGINT
HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]
ENTRYPOINT ["/bin/ogmios"]

#                                                                              #
# --------------------- RUN (cardano-node & ogmios) -------------------------- #
#                                                                              #

FROM inputoutput/cardano-node:1.31.0 as cardano-node-ogmios

ARG NETWORK=mainnet

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

LABEL name=cardano-node-ogmios
LABEL description="A JSON WebSocket bridge for cardano-node w/ a cardano-node."

COPY --from=build /app/ogmios/bin/ogmios /bin/ogmios
COPY --from=build /app/cardano-configurations/network/${NETWORK} /config

RUN mkdir -p /ipc

WORKDIR /root
COPY scripts/cardano-node-ogmios.sh cardano-node-ogmios.sh
# Ogmios, cardano-node, ekg, prometheus
EXPOSE 1337/tcp 3000/tcp 12788/tcp 12798/tcp
STOPSIGNAL SIGINT
HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]
CMD ["bash", "cardano-node-ogmios.sh" ]

Still work-in-progress however as the cardano-node-ogmios image isn't working properly (I need to overwrite the entrypoint of the image to the script doing the basic process monitoring.

redoracle commented 2 years ago
wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/'

that

Seems like this can work nicely with just wget as follows:

HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]

that's nice too, but still wget is missing as preinstalled package. while sed is there.

redoracle commented 2 years ago
# Ogmios, cardano-node, ekg, prometheus
EXPOSE 1337/tcp 3000/tcp 12788/tcp 12798/tcp

Do you really need to expose all those ports if only used internally? normally the internal process will open those ports internally anyway, and if needed those can be mapped with "-p" to the public host interface.

BTW very good point migrating to nix, I like it very much.

redoracle commented 2 years ago
wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/'

root@973ea926352e:/# wget http://localhost:1337 | sed 's/."connectionStatus":"([a-z]+)"./\1/' --2022-01-02 12:55:28-- http://localhost:1337/ Resolving localhost (localhost)... 127.0.0.1, ::1 Connecting to localhost (localhost)|127.0.0.1|:1337... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'index.html.20'

index.html.20 [ <=> ] 7.63K --.-KB/s in 0s

2022-01-02 12:55:28 (1.01 GB/s) - 'index.html.20' saved [7811]

not sure wget does the same of curl... or am I missing some other option?

the following returns the value that tell's us that ogmio is in sync, right? root@973ea926352e:/# curl -s http://127.0.0.1:1337/health | jq .networkSynchronization 1 which I presume implies that is connected. root@973ea926352e:/# curl -s http://127.0.0.1:1337/health | jq .connectionStatus
"connected"

KtorZ commented 2 years ago

that's nice too, but still wget is missing as preinstalled package. while sed is there.

Even on the new images with Nix, that is, on top of BusyBox? I thought wget was available in BusyBox ... :thinking:

Do you really need to expose all those ports if only used internally?

Those aren't internal though. except maybe 3000/tcp. ekg and prometheus are used for metrics, and ogmios is used for local clients.

not sure wget does the same of curl... or am I missing some other option?

Ah! My mistake... We need to hit the health endpoint here! So http://localhost:1337/health !!

redoracle commented 2 years ago

So http://localhost:1337/health !!

ok, but wget keeps saving the file not printing it, therefore I need an additional step to retrive the particular metric which says that the node is connected and in sync from the saved file. right?

redoracle commented 2 years ago

So http://localhost:1337/health !!

ok, but wget keeps saving the file not printing it, therefore I need an additional step to retrive the particular metric which says that the node is connected and in sync from the saved file. right?

what about this? wget -qO- http://localhost:1337/health | sed 's/.*\"connectionStatus\":\"//g' | sed 's/connected\"}/1/g'

redoracle commented 2 years ago

for now I got it working with an healthchek.sh mapped inside the container as follow:

if ! command -v wget; then apt update && apt -y install wget; fi

result=$(wget -qO- http://localhost:1337/health | sed 's/.*\"connectionStatus\"\:\"//g' | sed 's/connected\"}/0/g')

if [ $result != 0 ]; then exit 1; fi

I guess with the NIX version it wouldn't work though :)

Screen Shot 2022-01-02 at 2 59 37 PM
KtorZ commented 2 years ago

I figured that a nicer way to do all this would be to have a proper health-check command in Ogmios to begin with, so I implemented:

$ ogmios health-check --help
Handy command to check whether an Ogmios server is up-and-running, and correctly connected to a Network / cardano-node.

This can, for example, be wired to Docker's HEALTHCHECK feature easily.

Usage: ogmios health-check [--port TCP/PORT]
  Performs a health check against a running server.

Available options:
  -h,--help                Show this help text
  --port TCP/PORT          Port to listen on. (default: 1337)

(see 62691fbbbd65fa9b0b5949819515674c9a8c3575)

It exits with 0 or 1, depending on whether it could perform a health check on a running server. Dead-simple to configure the HEALTHCHECK in the Dockerfile with that:

HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD /bin/ogmios health-check
redoracle commented 2 years ago

That's very thoughtful and very nice!!

Well done! Tnx