CLI - Node running-status for pool operators

erikd commented 4 years ago

Pool operators on the ITN use something that connects to the Jormungandr node and pulls out the following data:

lastBlockHeight
lastBlockHash
lastPoolID (leader for the last block)
lastParent (parent block for the last block)
lastSlot
lastEpoch
version (jormungander full version info)

The new Haskell node should make this at least as easy. It should also add the block arrival time (with as good resolution as pssoible) to the above list.

One possibility would be a CLI command like:

cardano-cli shelley node running-status

that connects to the node and dumps the required output as JSON to stdout. Users would then be able to pipe that to a program to submit it to https://pooltool.io/ .

Another possibility to to just use a web socket in the node to serve this JSON.

Most importantly, this feature should be implemented as push from the node rather than require polling the node.

shawnim commented 4 years ago

Would be nice if the status information also contained whether the node was running as a leader or passive node.

papacarp commented 4 years ago

Thank you for putting this up Erik. Last block timestamp would be great as well so we can analyze block propagation worldwide. As in the receive timestamp (with as much resolution as possible).

erikd commented 4 years ago

Yes, I added that above :D . I have now made it bold .

gufmar commented 4 years ago

Here's a list of what we had in previous testnets

node ressources

node uptime
used memory
nodesEstablished (number of remote TCP connections for block sync)
nodesEstablishedUnique (unique remote IP addresses)

blockchain

lastBlockDate
lastBlockHeight
lastBlockTx
nextBlockScheduled
lastBlockCreated
TxRecvCnt
currentTip

block production and propagation

productionInEpochCnt (node's count of produced and propagated blocks)

What would be important as an improvement in future testnets

All values (like memory and CPU usage) should come from exact one node instance, even and especially when multiple instances run on the same machine.

Record certain values in 2-3 different frequencies. lastBlockDate and currentTip change every couple of second and it's important to capture it as accurately as possible. instead used-memory or currentStake can be recorded at 1 or 5 minute intervals

erikd commented 4 years ago

From @cardanians in #980 (closed):

It would be great if there could be something like cardano-cli node status with

node uptime

no. peers

eader(s) ( array with leaders-node ids); if no leader loaded, then empty array

node IP

lastBlockHeight

lastBlockHash

lastParent

lastSlot

lastEpoch

version HCN

md5 checksum

json output available

This is probably useful as a one shot CLI command as well as being running status available from the node via a websocket or something. I wonder, @cardanians if having the running status would subsume the need for a one shot CLI output?

@cardanians What is HCN and which md5 checksum do you mean?

rdlrt commented 4 years ago

I suppose HCN is his abbreviation for Haskell Cardano Node, as we have jormungandr server version available in output of node status (version number - commit ID). Not sure if md5 referred is actually to verify subversion/code modification, or same as commit ID

cardanians commented 4 years ago

I suppose HCN is his abbreviation for Haskell Cardano Node, as we have jormungandr server version available in output of node status (version number - commit ID). Not sure if md5 referred is actually to verify subversion/code modification, or same as commit ID

im planning on https://adapools.org similar feature;

you're right with HCN (we had ITN, now in FF we have HTN, haskel testnet)

md5 of all other output values (md5([node uptime].[no. peers].implode("-".[leaders]).lastBlockHeight.lastBlockHash........)

it could protect (in some cases - from operators, which doesnt know structure of this hash) against simple output-modifies.

erikd commented 4 years ago

Any changes an operator could want to make to these fields would be trivially easy to implement in the software itself before the md5sum is calculated. Since the node runs on other peoples hardware under their direct control there is no way to prevent this.

Since there no feasible way to protect this data from an operator that wants to change it badly enough, adding an md5sum only prevents the most naive operators from modifying it. A bigger problem is that if we implement this people who do figure out how to get around it may be at an advantage in comparison to the honest operators.

I simply don't think its worth it. A better option would be a social one. In the adapools web interface, flag any values you think have been tampered with.

cardanians commented 4 years ago

Any changes an operator could want to make to these fields would be trivially easy to implement in the software itself before the md5sum is calculated. Since the node runs on other peoples hardware under their direct control there is no way to prevent this.

Since there no feasible way to protect this data from an operator that wants to change it badly enough, adding an md5sum only prevents the most naive operators from modifying it. A bigger problem is that if we implement this people who do figure out how to get around it may be at an advantage in comparison to the honest operators.

I simply don't think its worth it. A better option would be a social one. In the adapools web interface, flag any values you think have been tampered with.

if it complicated half of the fraud or misunderstanding, it might be worth it. but, I admit, it's not necessary and checksum absolutelly isnt "need to have" feature.

dcoutts commented 4 years ago

The block receive time is included in the logs, but it will not be retained in the chain info. The logs is where to get timing measurement from, that's how we do it for our benchmarking. They are structured logs so it's not that hard.

papacarp commented 4 years ago

Regarding logs, I experimented with adding a socket interface to try to collect the data. This would be a more ideal method of collecting the info because its pushed on each block, rather than polled. However, while I can get a python script to listen to the socket using TraceForwarderBK, it has problems:

On initialization from a clean slate the socket works fine. If either the python script is stopped or the node is stopped, its unable to restart. Python, if restarted, will complain about the socket already in use. The node, if restarted, simply won't reconnect to the socket and continue transmitting.

My work around was to delete the socket, and restart both whenever either needs restarting... which is not viable. I admit my time spent on debugging it has been limited to say the least.

Since the timing data will need to come from the logs anyway one idea would be to spend time creating a sample app to connect to the TraceForwarderBK (that fixes the issues I'm stuck on) rather than a new CLI command. CLI would be great in general, but the right solution is likely a socket push strategy longer term given that we need to monitor logs anyway.

Example:

sock = s.socket(s.AF_UNIX)
sock.bind('/tmp/mikepipe2')
sock.listen(1)

while True:
  sd, address = sock.accept()
  fd = sys.stdout

  while (True):
    line = sd.recv(1024)
    while (line):
      decodedline = line.decode('utf-8')
      #do stuff with data
      line = sd.recv(1024)

  fd.close()
  sd.close()

sock.close()

Credit to iiLap for helping with this script

erikd commented 4 years ago

Extracting the need info from the logs is a pain in the neck. Providing a web socket (which pushes data when it arrives) is pretty much the ultimate solution for this issue.

CodiePP commented 4 years ago

some of these informations is already traced or computed from traced data and shown in the TUI or forwarded to EKG/Prometheus. maybe this effort can also take it from there.

erikd commented 4 years ago

The TUI is not useful for the intended use case. Pool operators need this accessible in way that is automatable and and machine readable. The TUI does not fit. EKG is great, but it still is not what pool operators actually need. Specifically EKG would require polling whereas pool operators would prefer if this information was pushed from the node.

mnaboka commented 4 years ago

+1 to this. I am looking for a way to get a number of currently connected peers :/ ideally i'd love to call an endpoint to get this info

dcoutts commented 4 years ago

Extracting the need info from the logs is a pain in the neck. Providing a web socket (which pushes data when it arrives) is pretty much the ultimate solution for this issue.

We have structured JSON logs. We have the TraceForwarderBK, which also pushes. Apparently as noted above this would need some more attention to work reliably, but that should cover exactly this case.

dcoutts commented 4 years ago

So we have a polling style CLI command to get the tip info, but if people want a more continuous interface, then perhaps they'd like to go lower level. There's a websockets proxy for the node's local chain sync client. https://github.com/KtorZ/cardano-ogmios

laplasz commented 4 years ago

also I am missing the status info of the node.

in jormungandr there is a cli command to get stats which also provide a status of the node.

jcli rest v0 node stats get -h <api_address>
...
peerTotalCnt: 10240
peerUnreachableCnt: 0
state: Running
...

it would be great to have info when a node is bootstrapping or syncing..

savicsava commented 4 years ago

@erikd @kevinhammond @Jimbo4350 @dcoutts what is the status here for this one?

cardanians commented 4 years ago

still nothing?

erikd commented 4 years ago

Somehow this escaped out attention. We are on it now.

dcoutts commented 4 years ago

If we do this, I think it should be via the monitoring metrics. I think most of these things are already available via the metrics, so it should be just a matter of reviewing them and seeing if we need a few more.

erikd commented 4 years ago

We are working on this, but its tricky and non-trivial.

savicsava commented 3 years ago

What is the status here?

papacarp commented 3 years ago

From my perspective, this is no longer a priority. We ended up going down the route mentioned above and interfaced directly to the chain sync client through @AndrewWestberg 's CNCLI so we could pull out everything we need.

IntersectMBO / cardano-node

CLI - Node running-status for pool operators #801