Open erikd opened 4 years ago
Would be nice if the status information also contained whether the node was running as a leader or passive node.
Thank you for putting this up Erik. Last block timestamp would be great as well so we can analyze block propagation worldwide. As in the receive timestamp (with as much resolution as possible).
Yes, I added that above :D . I have now made it bold .
Here's a list of what we had in previous testnets
node ressources
blockchain
block production and propagation
What would be important as an improvement in future testnets
All values (like memory and CPU usage) should come from exact one node instance, even and especially when multiple instances run on the same machine.
Record certain values in 2-3 different frequencies. lastBlockDate and currentTip change every couple of second and it's important to capture it as accurately as possible. instead used-memory or currentStake can be recorded at 1 or 5 minute intervals
From @cardanians in #980 (closed):
It would be great if there could be something like
cardano-cli node status
with
- node uptime
- no. peers
- eader(s) ( array with leaders-node ids); if no leader loaded, then empty array
- node IP
- lastBlockHeight
- lastBlockHash
- lastParent
- lastSlot
- lastEpoch
- version HCN
- md5 checksum
- json output available
This is probably useful as a one shot CLI command as well as being running status available from the node via a websocket or something. I wonder, @cardanians if having the running status would subsume the need for a one shot CLI output?
@cardanians What is HCN and which md5 checksum do you mean?
I suppose HCN is his abbreviation for Haskell Cardano Node, as we have jormungandr server version available in output of node status (version number - commit ID). Not sure if md5 referred is actually to verify subversion/code modification, or same as commit ID
I suppose HCN is his abbreviation for Haskell Cardano Node, as we have jormungandr server version available in output of node status (version number - commit ID). Not sure if md5 referred is actually to verify subversion/code modification, or same as commit ID
im planning on https://adapools.org similar feature;
you're right with HCN (we had ITN, now in FF we have HTN, haskel testnet)
md5 of all other output values (md5([node uptime].[no. peers].implode("-".[leaders]).lastBlockHeight.lastBlockHash........)
it could protect (in some cases - from operators, which doesnt know structure of this hash) against simple output-modifies.
Any changes an operator could want to make to these fields would be trivially easy to implement in the software itself before the md5sum is calculated. Since the node runs on other peoples hardware under their direct control there is no way to prevent this.
Since there no feasible way to protect this data from an operator that wants to change it badly enough, adding an md5sum only prevents the most naive operators from modifying it. A bigger problem is that if we implement this people who do figure out how to get around it may be at an advantage in comparison to the honest operators.
I simply don't think its worth it. A better option would be a social one. In the adapools web interface, flag any values you think have been tampered with.
Any changes an operator could want to make to these fields would be trivially easy to implement in the software itself before the md5sum is calculated. Since the node runs on other peoples hardware under their direct control there is no way to prevent this.
Since there no feasible way to protect this data from an operator that wants to change it badly enough, adding an md5sum only prevents the most naive operators from modifying it. A bigger problem is that if we implement this people who do figure out how to get around it may be at an advantage in comparison to the honest operators.
I simply don't think its worth it. A better option would be a social one. In the adapools web interface, flag any values you think have been tampered with.
if it complicated half of the fraud or misunderstanding, it might be worth it. but, I admit, it's not necessary and checksum absolutelly isnt "need to have" feature.
The block receive time is included in the logs, but it will not be retained in the chain info. The logs is where to get timing measurement from, that's how we do it for our benchmarking. They are structured logs so it's not that hard.
Regarding logs, I experimented with adding a socket interface to try to collect the data. This would be a more ideal method of collecting the info because its pushed on each block, rather than polled. However, while I can get a python script to listen to the socket using TraceForwarderBK
, it has problems:
On initialization from a clean slate the socket works fine. If either the python script is stopped or the node is stopped, its unable to restart. Python, if restarted, will complain about the socket already in use. The node, if restarted, simply won't reconnect to the socket and continue transmitting.
My work around was to delete the socket, and restart both whenever either needs restarting... which is not viable. I admit my time spent on debugging it has been limited to say the least.
Since the timing data will need to come from the logs anyway one idea would be to spend time creating a sample app to connect to the TraceForwarderBK
(that fixes the issues I'm stuck on) rather than a new CLI command. CLI would be great in general, but the right solution is likely a socket push strategy longer term given that we need to monitor logs anyway.
Example:
sock = s.socket(s.AF_UNIX)
sock.bind('/tmp/mikepipe2')
sock.listen(1)
while True:
sd, address = sock.accept()
fd = sys.stdout
while (True):
line = sd.recv(1024)
while (line):
decodedline = line.decode('utf-8')
#do stuff with data
line = sd.recv(1024)
fd.close()
sd.close()
sock.close()
Credit to iiLap for helping with this script
Extracting the need info from the logs is a pain in the neck. Providing a web socket (which pushes data when it arrives) is pretty much the ultimate solution for this issue.
some of these informations is already traced or computed from traced data and shown in the TUI or forwarded to EKG/Prometheus. maybe this effort can also take it from there.
The TUI is not useful for the intended use case. Pool operators need this accessible in way that is automatable and and machine readable. The TUI does not fit. EKG is great, but it still is not what pool operators actually need. Specifically EKG would require polling whereas pool operators would prefer if this information was pushed from the node.
+1 to this. I am looking for a way to get a number of currently connected peers :/ ideally i'd love to call an endpoint to get this info
Extracting the need info from the logs is a pain in the neck. Providing a web socket (which pushes data when it arrives) is pretty much the ultimate solution for this issue.
We have structured JSON logs. We have the TraceForwarderBK
, which also pushes. Apparently as noted above this would need some more attention to work reliably, but that should cover exactly this case.
So we have a polling style CLI command to get the tip info, but if people want a more continuous interface, then perhaps they'd like to go lower level. There's a websockets proxy for the node's local chain sync client. https://github.com/KtorZ/cardano-ogmios
also I am missing the status info of the node.
in jormungandr there is a cli command to get stats which also provide a status of the node.
jcli rest v0 node stats get -h <api_address>
...
peerTotalCnt: 10240
peerUnreachableCnt: 0
state: Running
...
it would be great to have info when a node is bootstrapping or syncing..
@erikd @kevinhammond @Jimbo4350 @dcoutts what is the status here for this one?
still nothing?
Somehow this escaped out attention. We are on it now.
If we do this, I think it should be via the monitoring metrics. I think most of these things are already available via the metrics, so it should be just a matter of reviewing them and seeing if we need a few more.
We are working on this, but its tricky and non-trivial.
What is the status here?
From my perspective, this is no longer a priority. We ended up going down the route mentioned above and interfaced directly to the chain sync client through @AndrewWestberg 's CNCLI so we could pull out everything we need.
Pool operators on the ITN use something that connects to the Jormungandr node and pulls out the following data:
and submits it to https://pooltool.io/
The new Haskell node should make this at least as easy. It should also add the block arrival time (with as good resolution as pssoible) to the above list.
One possibility would be a CLI command like:
that connects to the node and dumps the required output as JSON to
stdout
. Users would then be able to pipe that to a program to submit it to https://pooltool.io/ .Another possibility to to just use a web socket in the node to serve this JSON.
Most importantly, this feature should be implemented as push from the node rather than require polling the node.