Closed bneiluj closed 4 years ago
Hi @bneiluj, thank you for opening the issue and for your interest in P.A.N.I.C.
From a Telegram status point of view, the /status
command is meant to show that the monitors are still running. The /status
command does not guarantee that the validator itself is still running; this is handled by the alerting part of P.A.N.I.C.
From an alerting standpoint, the node monitor does depend on the RPC calls providing reliable information. The node monitor does compare fetched validator state with previous state; this is how P.A.N.I.C. is able to let you know that the number of peers has decreased, for example. However, I do like the idea of comparing UTC time < previous UTC time as an extra alert, and the idea of signing the information; these could be great additions to P.A.N.I.C.
I would appreciate if you could elaborate on the three examples of incorrect data (cached data, wrong RPC calls, stuck in loop). As for cached data, from a Python point of view, the get
requests being done do not use caching by default. Not exactly sure what is meant by wrong RPC calls. And as for getting stuck in loops, the get
requests have timeouts set up, which if exceeded trigger an alert that delays are being experienced and subsequently that the node is offline.
In all of this, it is important to note the role of the network monitor, which uses an auxiliary node or set of nodes (i.e. excluding the validator itself) to make sure that, at the very least, the validator is not missing blocks. It does this by checking every height for the validator's precommit.
Thanks for your quick reply.
I understand how P.A.N.I.C is built and really like it. Let's look at an architecture point of view - without digging into the code or the different internal modules. Considering that P.A.N.I.C is a black box and that black box communicates with the Validator to fetch state information. The Black Box (P.A.N.I.C) also communicates with a telegram box to send and receive user information. The actual mode is: "Log alert" if issue. To verify that the entire communication between Telegram <-> P.A.N.I.C <-> Validator is running without any problem, the information received by the P.A.N.I.C node and at the end (Telegram) need to be tampered proof by a "signed value" coming from the Validator (A value that can only be sent from the Validator). Doing so it guarantees that the Validator hasn't been hacked or it's still running or the data doesn't come from fake data being pushed by the P.A.N.I.C node.
That's a great idea that we will definitely consider implementing as we continue to extend the capabilities of P.A.N.I.C. In reality, the tool was not built with detection of cases of hacks in mind and to a certain extent assumes a reasonably secure environment.
Yes, no worries Miguel. But my previous points were not so much about "hacks in mind" but rather about a broken flow in the current architecture implementation.
For example, basic scenario:
=> ! BROKEN !
In the above scenario, the Telegram bot will receive no error messages even if the Validator is down since the Monitor responsible for the logging is down!
So, one of the major flaw in the current architecture is "If Validator fine then no messages on telegram bot". This statement is justified even if the Monitor instance goes down (or has a communication issue) first and then everything can happen. This issue can be fixed by implementing a regular status update (~2h or something like that) from the monitor to the telegram bot.
If you can't guarantee that the Monitor is running to catch errors then you are never 100% sure that the Validator is running fine, which makes the entire PANIC Monitor Architecture obsolete.
Your ideas are greatly appreciated Julien. What you are saying is accurate and correct. We can definitely put in an optional status update as a sign that the monitoring is still running.
With the current version of P.A.N.I.C, one can also greatly reduce the chance of the described scenario happening by running multiple instances in different geographical locations. Alternatively, the latest updates from each monitor can be queried from Redis (programmatically) and checked as a form of monitoring of P.A.N.I.C. We also recommend that P.A.N.I.C. is set up as a Linux service, to further reduce the possibility of P.A.N.I.C. being offline.
As far as I can see on https://github.com/SimplyVC/panic/blob/master/doc/DESIGN_AND_FEATURES.md and after testing the implementation in our current stack, nothing guarantee that the Validator doesn't have an issue or is still running while receiving Node Monitor status.
The issue is that the current design doesn't check if the information received is effectively being sent by the Cosmos validator. It could be cached data, wrong RPC calls, stuck in loop etc.
The node should at least compare fetched Validator state information with previous state. For example, if block number < previous block number, UTC time < previous UTC time etc then triggers error. Ideally, the information received by the node monitor should be signed by the validator and then verified by the node monitor, so that the node monitor can always be ensured that the validator is running fine.