ergoplatform / oracle-core

Core off-chain component of Oracle Pools
Apache License 2.0
61 stars 37 forks source link

Health Endpoint #269

Closed reqlez closed 1 year ago

reqlez commented 1 year ago

Oracle needs an endpoint where somebody can check it's health. I propose /isHealthy for the endpoint name. Please suggest others, if there is a better standard to use.

The health of the oracle is good to know, for different applications, example:

Currently, Ergo Node does not have a Health endpoint either, I also made a comment about a similar endpoint for node here: https://github.com/ergoplatform/ergo/issues/1755#issuecomment-1436074163

Some things to discuss:

  1. Should oracle check the node it connects to ? if several nodes are present ( saw a new issue opened with multi node support ), in that case, check all and return healthy if at least one is okay? Fanta mentioned some Kung Fu regex to check this:

"GET /info and if /lH.(\d+),[\s\S]P.*\1,/gm matches, fullHeight is equal to maxPeerHeight, which would mean the node is synced"

However, I don't think EQUAL is the way to go here, I would do at least 1 block behind to be considered synced. To add, additional comparison would have to be implemented as well, example: "lastIncomingMessage": 1656365850283 VS "currentSystemTime": 1656365854007 to make sure the node is actually communicating with the network.

IMO, the best way to solve above, is add a Health endpoint to Ergo Node ( addressing the above issue linked ), so that Oracle can just check that endpoint instead of adding a bunch of extra code to Oracle.

  1. What would trigger an unhealthy state, otherwise? If ergonode is healthy? Some examples:
    • Has oracle posted a datapoint in the last X amount of time? what should be X?
    • Another indicator, that would catch the oracle being down quicker than above? Maybe look at last error message thrown?

Any other ideas?

greenhat commented 1 year ago

I agree that implementing complex logic to determine a node's health on the oracle side is not ideal. Instead, the node should provide a simple endpoint indicating its status as either green or red.

The ultimate metric for determining oracle's health is whether it has posted a data point within the last X blocks. However, we must first define what "posted" means. While the Oracle sends transactions containing data points, it does not know if these transactions are included in the chain.

To ensure complete accuracy, we need to verify that transactions with data point boxes are indeed included in the chain. Internally, we have our last posted data point box from the node wallet scan.

Do the remaining question is how to determine X? Ideally, we want it to be no older than the last epoch, but realistically we probably should leave some leeway (2 epochs old?) to weather out mempool hiccups, etc.

greenhat commented 1 year ago

ref #53