Open arhag opened 2 years ago
Possible simplified proposal:
Do not store anything in blockchain state. Use addalert
action to eosio.alert
to trigger behavior in chain_plugin
when action is irreversible
. The alert would shutdown the node with an error
level log indicating the alert message and time when the node can be restarted. On startup scan block log head block for addalert
to eosio.alert
to determine if timeout has passed. Could still honor additional alert issuers via configurable option.
Also add an option to log an error
level log every x seconds instead of shutting down. Under this approach this log message would not be re-initiated on a restart unless a new addalert
is sent to the chain. This addalert
could be sent via an oracle periodically if desired to give better coverage of recently started nodes.
@heifner: In practice, the addalert
action would be called as an inline action (e.g. via multisig). So it wouldn't show up in the blocks log.
@arhag I guess I actually meant alertadd
then.
Motivation
Node operators may be unaware of critical upgrades to Leap (or other future Antelope implementations) that are necessary either for security reasons (to fix security bugs in the old version of the software) or to prepare for an upcoming protocol upgrade.
If a node operator has not updated their node to fix a critical security patch, they may be better off having their software shut down (or at least stop synchronizing with the blockchain or communicating with peers) than to continue processing on their vulnerable node.
If a node operator has not updated their an upcoming protocol upgrade, then their node will automatically stop synchronizing with the blockchain anyway. But this means their node (and their businesses dependent on it) remain unavailable until they upgrade their tools and processes to transition to the new version of the node that supports the protocol upgrade. Depending on the upgrade, the dependent services they maintain, and their state of readiness, this could take a long time. And all the while, their business could be down.
It is obviously desirable for the node operators to have sufficient time well before the protocol activation to carry out this upgrade and have all their updated nodes ready to go for protocol activation. The issue is that many node operators aren't paying close attention to planned activation despite efforts to spread the message and do outreach. Many procrastinate even determining what the scope of the upgrade work is until it becomes obvious that they have no other choice but to take action to avoid downtime of their services.
So a feature to do a soft trial run of the impact of a protocol activation could be valuable in getting more node operators to upgrade in a timely manner to minimize downtime. The idea is to take down the nodes that aren't ready for the upgrade for a temporary and adjustable timeout period as a way to make it clear to them that some of their nodes haven't been updated properly for the upcoming upgrade. But since this would not actually activate the protocol features yet, it would be possible to allow those old nodes to resume syncing and operation after the timeout period. That way the disruption to their business is minimal and it gives them an opportunity for a significant amount of time after that event to actually upgraded their nodes so that they are ready for the real upgrade.
Feature
Changes can be added to the
chain_plugin
to enable this new alert feature. The signaling mechanism would be through a new contract to be deployed on a standard account:eosio.alert
. If the account does not exist, a contract is not deployed to the account, or the contract does not appear to be following the appropriate alert standard, this feature will automatically be disabled (at least until those conditions change) and not cause disruption to the operation of the node.Alert standard
The alert standard involves tracking the current state of alerts on the chain and providing a mechanism to access it. It also involves actions
alertadd
andalertremove
which are not to be called but only to be emitted as inline actions by theeosio.alert
contract to act as an event to signal that a new unexpired alert has been added to (in the case ofalertadd
) or removed from (in the case ofalertremove
) the state of active alerts tracked within thealerts
table.The signature of the
alertadd
andalertremove
actions are:The standard also provides an additional action
getalerts
to access the currently active alerts for any alert issuer:The
getalerts
action allows looking up one or more actions that are issued by a specified issuer (std::get<get_alerts_by_id_output_v0>(input).issuer
) using a pagination mechanism. The pagination mechanism allows the caller of the action to get the matching actions in order (either in ascending or descendingalert_id
order, which due to the nature of the wayalert_id
works also corresponds to the order in which alerts were created).In the returned output, assuming the
results
vector is not empty, apagination_metadata
is included which will havefirst_cursor
andlast_cursor
values. In the case of thegetalerts
actions, these cursor values will actually bealert_id
s. Regardless of which order the pagination was requested in, theresults
vector from thegetalerts
output will always be sorted in ascending order ofalert_id
and, assuming theresults
vector is not empty,first_cursor
will be thealert_id
of the first item in theresults
vector andlast_cursor
will be thealert_id
of the last item in theresults
vector.The comments of the fields for the
alert_type
(and related) data structures explain what they are and what purpose they serve for the alert. However, it is worth mentioning the following. First,timeout_sec
can be 0 which means that this is an alert that does not cause any nodes to become unavailable for any duration of time. Second, theexpiration
time supersedes the duration of a timeout indicated bytimeout_sec
meaning that once the alert expires, any timeouts caused by that alert on the node should end unless some other unexpired alert is causing a timeout to the node. Finally,getalerts
should return all tracked nodes even if they have already expired (even thoughchain_plugin
will ultimately ignore those alerts). This is because clients will need some way to know which expired alerts are still tracked in state and should ideally be garbage collected.The standard also requires that the contract provides a mechanism to add new alerts but does not require a particular interface for how to do that. It simply requires that within the action that creates the new alert, the contract ensures that it:
issuer
of that new action;alert_id
that is greater than any otheralert_id
previously assigned to an alert tracked by the contract;start_time
of the alert is the current block time at the time the alert is being added;expiration
of the alert is strictly greater than thestart_time
plus 5 minutes;alertadd
inline action with thealert_id
chosen for the new alert and the sameissuer
.The standard also requires that the contract provides a mechanism to remove an existing alert (whether expired or not). The standard does not specify the interface for this action to remove an existing alert, but it requires that:
issuer
of the alert to be removed is provided as an authority for the action;start_time
of the alert plus 5 minutes is strictly less than the current time;alertremove
inline action with thealert_id
andissuer
set to the same values as that of the removed alert.The standard recommends the contract provide some action that allows anyone to garbage collect expired alerts and also recommends that the action to add a new alert does a little bit of work in garbage collecting expired alerts (garbage collected two expired alerts, if possible, within the action to add a new alert is recommended).
The standard does not prevent other actions and capabilities to be added to the contract, but it requires that the contents of the alerts tracked in state remain immutable. If there is a desire to update the contents of an unexpired alert, the old
alert_id
should no longer be used, analertremove
action should be emitted for the oldalert_id
, the new contents should be associated with a brand newalert_id
, analertadd
action should be emitted for the newalert_id
, and thestart_time
in the new contents of the alert must be updated to the current block time.chain_plugin changes
With this feature enabled, chain plugin will watch out for
alertadd
andalertremove
actions in the contract on theeosio.alert
account within traces of executed transactions. It will attempt the decode its input payload according toalert_event_type
. If decoding is not possible, it will silently ignore processing the event. Otherwise, it will check the value of theissuer
field to see if it matches the list (an issuer name ofeosio.alert
always matches); see later discussion for how the node can be configured to match against other names. Ifissuer
does not match, it will skip this event. Otherwise, it uses the value ofalert_id
(and possibly state lookups) to keep its own internally tracked list of alerts (which can be ephemeral and not durably stored on disk) in sync with the contract state.If the event is from
alertadd
, the chain plugin will lookup the alert details using thegetalerts
action. If it doesn't exist it can skip this event. The standard does not want to require the contract to implement state stored in tables in a particular way. Instead it enforces how that state can be accessed via thegetalerts
action. This means that the chain_plugin will need to create and execute speculative transaction which it will discard rather than commit to state after the transaction from a signed block that generated thealertadd
action (either immediately after the transaction or later at the end of the block, but the queries must occur prior to the start of the next block). Because the standard requires that theexpiration
is strictly greater than thestart_time
and disallows removing alerts unless thestart_time
is strictly less than the current block time, it is guaranteed that the alert referenced by the event will still be in the state by the end of that block (though no guarantees exist beyond that point).Once chain_plugin gets the alert details it can use the filters in the
filters
field to determine whether it applies to this node. See the section further below to learn more about how the alert filters can be used for particular use cases. If it doesn't match, this event can be skipped. Otherwise, the alert will need to be tracked by chain_plugin in its ephemeral in-memory state. The alert with its associated contents is not immediately added to the ephemerally tracked active alerts within the memory of the chain_plugin but is instead added into a staging queue. The alerts are moved from this queue into the ephemerally tracked active alerts only when the block they originated from becomes irreversible.If the event is from
alertremove
and theissuer
matches, the chain plugin will add into another queue a pair consisting of the block number of the block where thealertremove
originated from and thealert_id
from thealertremove
event. Then as the blocks become irreversible, it will appropriately consume items from that queue and handle them by removing any alert from its ephemerally tracked active alerts in memory that matches thealert_id
.The two queues mentioned above should be appropriately trimmed from the back on fork switches so that information from other branches that is not part of the current canonical chain does not unintentionally carry over and ultimately impact the ephemerally tracked active alerts.
On startup, the chain_plugin should use
getalerts
to lookup all unexpired alerts from the contract state for theeosio.alert
issuer as well as any other matching issuers that are configured for the chain_plugin. It should partition these alerts into two sets based on theirstart_time
: those with astart_time
less than or equal to timestamp of the last irreversible block, and those with astart_time
greater than that timestamp. The first set of alerts forms the initial state for the ephemerally tracked alerts in chain_plugin memory and the second set is added into the queue which may be further pushed into fromalertadd
events.Note that this may not accurately reconstruct the ephemerally tracked alerts of another node with similar configuration that had been running prior for a while, because the newly started node would initiate state as of the (reversible) head block which has already applied removals initiated in recent (still reversible blocks). This means that the side-effect of removing the alert would be applied earlier for the restarted node than for the node that had been running for a while. Even worse, if a fork switch happens soon after startup, it is possible for some alerts that would remain active from the perspective of a node correctly following the canonical blockchain from a much earlier point in time to not be present at all in the recently restarted node (at least until it restarts again). These edge cases are mitigated though because the standard enforces a 5 minute delays between when an alert is created and when it first can be removed (or garbage collected). So if the delta in block timestamp between the head block and last irreversible block remains less than 5 minutes (as it usually does on the EOS Network, for example) then these edge cases would not be encountered at all. Given the mitigations and the fact that encountering these edge cases seems fairly harmless given the nature of what these alerts are trying to accomplish, it may be worth it accepting the edge cases as part of this design rather than constructing a more sophisticated design which makes those edge cases impossible but makes other compromises (e.g. putting the ephemeral tracked alert state into a database like Chainbase, or requiring the state to be unwound back to irreversible on startup and then replayed forward to initiate the chain_plugin state dealing with tracking alerts accurately).
Side-effects of active alerts
While an alert is active, we want to provide some way for the chain operator to see the state of currently active alerts through some API (perhaps this will need to be through the producer API or else we provide endpoints for the chain API that are meant to be kept for private operator use rather than exposing to the public?).
If the alert has a timeout, the timeout end time is determined by adding
timeout_sec
seconds to thestart_time
. The actual wall-clock time (not the block timestamp of the current head block) should then be used to determine when the timeout ends because during timeout the node may not be able to continue synchronizing blocks. Alternatively, we could provide a softer version of timeout to accomplish the same goals laid out in the motivation section by still allowing the node to synchronize and validate blocks but to simply disallow access to the endpoints in the chain API (or maybe all but the one to monitor the status of alerts?).When a new alert becomes active, the nodeos logs should indicate that happened and should also include the
alert_message
in the output. Perhaps this event should also be logged in other plugins that may exist in nodeos at the time like the Prometheus plugin.If nodeos is in the middle of a timeout due to an alert, that fact should probably be repeated in the logs periodically (along with the
alert_message
) to make it clear to the node operator just looking at the tail of the logs what is going wrong with the node and (hopefully through thealert_message
) what action they need to take to resolve it.Subscribing to other alert issuers
We can optionally choose to go further and allow the node operator to provide a list of additional alert issuers (by their Antelope account name) that chain_plugin will respect in terms of matching on alert events. The
eosio.alert
account will always be included as part of this match list whether explicitly included by the node operator or not.This allows other organizations to use the
eosio.alert
contract to signal alerts to their nodes assuming they are willing to make the existence of such alerts public on the blockchain. An organization may wish to use this feature to find any problems in their infrastructure due to old nodes they may have forgotten to update. And they may wish to do this on their own schedule with their own timeouts without relying on the schedule and plan of the BPs of the blockchain.The BPs of the blockchain would still be able to force alerts on the nodes syncing that blockchain assuming they had authority (e.g. through the BP multisig) to satisfy the permissions of the
eosio.alert
account.Use cases for the different alert filters
There are two types of alert filters that can be matched against:
implementation_version_filter
andprotocol_feature_filter
.The
implementation_version_filter
allows the alert issuer to issue an alert intended to match on versions of a particular implementation (identified by name, e.g.leap
) assuming the implementation follows the typical<major>.<minor>.<patch>
version structure. If the implementation also supports a version suffix (e.g.-rc1
) the version filter can still work and it will be handle the ordering (to determine if it is within the appropriate range or not) in the same way that version ordering works in semantic versioning.The
implementation_version_filter
is designed to be used to send alerts to versions of software that are known to have critical bugs that have been fixed in a later release. The alert mechanism allows sending alerts to those nodes with a message indicating the need to upgrade to a new version. The timeout can even be used to temporarily make those nodes unavailable if the alert issuer deems that to be a safer course of action to protect the network than allowing the nodes to remain available and vulnerable.The
protocol_feature_filter
is particularly useful to test the preparation of any nodes in the network that have not yet updated to support an upcoming protocol upgrade. After some time that a stable release of the implementation(s) of Antelope supporting the new protocol features has been made available, BPs may wish to issue an alert without any timeout but with an appropriately constructedprotocol_feature_filter
to send an alert to any nodes that have not been updated to support the necessary protocol features (identified by their digests) that they intend to activate in the near future with a message urging them to upgrade by some soft deadline to avoid any disruption to their nodes. Then when the soft deadline arrives, the BPs can again send a similar alert except this time it can have a short timeout (perhaps of 1 day) to put the pressure on any procrastinating node operators who have still not updated. The message can also indicate the date of the hard deadline when the protocol features will be activated and at which point the nodes must have been updated otherwise they will remain indefinitely offline.