Issue in the current architecture

bneiluj commented 4 years ago

As far as I can see on https://github.com/SimplyVC/panic/blob/master/doc/DESIGN_AND_FEATURES.md and after testing the implementation in our current stack, nothing guarantee that the Validator doesn't have an issue or is still running while receiving Node Monitor status.

The node monitor deals with exactly one node, such that multiple node monitors are started up if you set up the alerter with multiple nodes. In a typical monitoring round, the node monitor:

1. Checks if the node is reachable from [RPC_URL]
2. Gets the node's status from [RPC_URL]/status
-Gets and stores the voting power
-Gets and stores the catching-up status
3. Gets the node's net info from [RPC_URL]/net_info
- Gets and stores the number of peers
4. Saves its state and the node's state
5. Sleeps until the next monitoring round

At each step, the network monitor goes through the data sources and picks the first full node that responds ([RPC_URL]/health). Having additional full nodes increases data source redundancy.

The issue is that the current design doesn't check if the information received is effectively being sent by the Cosmos validator. It could be cached data, wrong RPC calls, stuck in loop etc.

The node should at least compare fetched Validator state information with previous state. For example, if block number < previous block number, UTC time < previous UTC time etc then triggers error. Ideally, the information received by the node monitor should be signed by the validator and then verified by the node monitor, so that the node monitor can always be ensured that the validator is running fine.

migueldingli1997 commented 4 years ago

Hi @bneiluj, thank you for opening the issue and for your interest in P.A.N.I.C.

From a Telegram status point of view, the /status command is meant to show that the monitors are still running. The /status command does not guarantee that the validator itself is still running; this is handled by the alerting part of P.A.N.I.C.

From an alerting standpoint, the node monitor does depend on the RPC calls providing reliable information. The node monitor does compare fetched validator state with previous state; this is how P.A.N.I.C. is able to let you know that the number of peers has decreased, for example. However, I do like the idea of comparing UTC time < previous UTC time as an extra alert, and the idea of signing the information; these could be great additions to P.A.N.I.C.

I would appreciate if you could elaborate on the three examples of incorrect data (cached data, wrong RPC calls, stuck in loop). As for cached data, from a Python point of view, the get requests being done do not use caching by default. Not exactly sure what is meant by wrong RPC calls. And as for getting stuck in loops, the get requests have timeouts set up, which if exceeded trigger an alert that delays are being experienced and subsequently that the node is offline.

In all of this, it is important to note the role of the network monitor, which uses an auxiliary node or set of nodes (i.e. excluding the validator itself) to make sure that, at the very least, the validator is not missing blocks. It does this by checking every height for the validator's precommit.

bneiluj commented 4 years ago

Thanks for your quick reply.

I understand how P.A.N.I.C is built and really like it. Let's look at an architecture point of view - without digging into the code or the different internal modules. Considering that P.A.N.I.C is a black box and that black box communicates with the Validator to fetch state information. The Black Box (P.A.N.I.C) also communicates with a telegram box to send and receive user information. The actual mode is: "Log alert" if issue. To verify that the entire communication between Telegram <-> P.A.N.I.C <-> Validator is running without any problem, the information received by the P.A.N.I.C node and at the end (Telegram) need to be tampered proof by a "signed value" coming from the Validator (A value that can only be sent from the Validator). Doing so it guarantees that the Validator hasn't been hacked or it's still running or the data doesn't come from fake data being pushed by the P.A.N.I.C node.

migueldingli1997 commented 4 years ago

That's a great idea that we will definitely consider implementing as we continue to extend the capabilities of P.A.N.I.C. In reality, the tool was not built with detection of cases of hacks in mind and to a certain extent assumes a reasonably secure environment.

bneiluj commented 4 years ago

Yes, no worries Miguel. But my previous points were not so much about "hacks in mind" but rather about a broken flow in the current architecture implementation.

For example, basic scenario:

For some reasons, the Monitor goes offline / down
Right after that, the Validator goes offline / down

=> ! BROKEN !

In the above scenario, the Telegram bot will receive no error messages even if the Validator is down since the Monitor responsible for the logging is down!

So, one of the major flaw in the current architecture is "If Validator fine then no messages on telegram bot". This statement is justified even if the Monitor instance goes down (or has a communication issue) first and then everything can happen. This issue can be fixed by implementing a regular status update (~2h or something like that) from the monitor to the telegram bot.

If you can't guarantee that the Monitor is running to catch errors then you are never 100% sure that the Validator is running fine, which makes the entire PANIC Monitor Architecture obsolete.

migueldingli1997 commented 4 years ago

Your ideas are greatly appreciated Julien. What you are saying is accurate and correct. We can definitely put in an optional status update as a sign that the monitoring is still running.

With the current version of P.A.N.I.C, one can also greatly reduce the chance of the described scenario happening by running multiple instances in different geographical locations. Alternatively, the latest updates from each monitor can be queried from Redis (programmatically) and checked as a form of monitoring of P.A.N.I.C. We also recommend that P.A.N.I.C. is set up as a Linux service, to further reduce the possibility of P.A.N.I.C. being offline.

SimplyStaking / panic_cosmos

Issue in the current architecture #9