Antelope Alert Mechanism: Timeout and warnings for out-of-date nodes

Motivation

Node operators may be unaware of critical upgrades to Leap (or other future Antelope implementations) that are necessary either for security reasons (to fix security bugs in the old version of the software) or to prepare for an upcoming protocol upgrade.

If a node operator has not updated their node to fix a critical security patch, they may be better off having their software shut down (or at least stop synchronizing with the blockchain or communicating with peers) than to continue processing on their vulnerable node.

If a node operator has not updated their an upcoming protocol upgrade, then their node will automatically stop synchronizing with the blockchain anyway. But this means their node (and their businesses dependent on it) remain unavailable until they upgrade their tools and processes to transition to the new version of the node that supports the protocol upgrade. Depending on the upgrade, the dependent services they maintain, and their state of readiness, this could take a long time. And all the while, their business could be down.

It is obviously desirable for the node operators to have sufficient time well before the protocol activation to carry out this upgrade and have all their updated nodes ready to go for protocol activation. The issue is that many node operators aren't paying close attention to planned activation despite efforts to spread the message and do outreach. Many procrastinate even determining what the scope of the upgrade work is until it becomes obvious that they have no other choice but to take action to avoid downtime of their services.

So a feature to do a soft trial run of the impact of a protocol activation could be valuable in getting more node operators to upgrade in a timely manner to minimize downtime. The idea is to take down the nodes that aren't ready for the upgrade for a temporary and adjustable timeout period as a way to make it clear to them that some of their nodes haven't been updated properly for the upcoming upgrade. But since this would not actually activate the protocol features yet, it would be possible to allow those old nodes to resume syncing and operation after the timeout period. That way the disruption to their business is minimal and it gives them an opportunity for a significant amount of time after that event to actually upgraded their nodes so that they are ready for the real upgrade.

Feature

Changes can be added to the chain_plugin to enable this new alert feature. The signaling mechanism would be through a new contract to be deployed on a standard account: eosio.alert. If the account does not exist, a contract is not deployed to the account, or the contract does not appear to be following the appropriate alert standard, this feature will automatically be disabled (at least until those conditions change) and not cause disruption to the operation of the node.

Alert standard

The alert standard involves tracking the current state of alerts on the chain and providing a mechanism to access it. It also involves actions alertadd and alertremove which are not to be called but only to be emitted as inline actions by the eosio.alert contract to act as an event to signal that a new unexpired alert has been added to (in the case of alertadd) or removed from (in the case of alertremove) the state of active alerts tracked within the alerts table.

The signature of the alertadd and alertremove actions are:


struct alert_event_v0 {
   uint64_t    alert_id; // unique identifier for alert used to lookup alert details
   eosio::name issuer; // issuer of alert duplicated in event to help consumers determine
                       // whether to bother to lookup alert details
};

using alert_event_type = std::variant<alert_event_v0>;

void alertadd(alert_event_type alert_event);
void alertremove(alert_event_type alert_event);

The standard also provides an additional action getalerts to access the currently active alerts for any alert issuer:

// Generic pagination data structures:

template <typename LookupKeyType>
struct start_pagination {
   uint32_t      limit;
   LookupKeyType start_key;
};

template <typename Cursor>
struct before_cursor_pagination {
   uint32_t last;                // number of items to retrieve prior to cursor
   std::optional<Cursor> before; // optional opaque cursor
};

template <typename Cursor>
struct after_cursor_pagination {
   uint32_t first;              // number of items to retrieve after to cursor
   std::optional<Cursor> after; // optional opaque cursor
};

template <typename LookupKeyType, typename Cursor>
using pagination_query = std::variant<start_pagination<LookupKeyType>,
                                      before_cursor_pagination<Cursor>,
                                      after_cursor_pagination<Cursor>>;

template <typename Cursor>
struct pagination_metadata {
   Cursor first_cursor;
   Cursor last_cursor;
};

// Data structures for alerts tracked in state:
// They do not need to be stored in table data with this structure.
// But it should be possibly to transparently convert from the table data into the following structure.

struct implementation_version {
   uint32_t major;
   uint32_t minor;
   uint32_t patch;
};

struct version_range {
   implementation_version start_version; // included in range
   implementation_version end_version; // indicates upper bound but not included in range
};

/**
 * Note version_range{{1.2.3}, {2.0.0}} defines a range which
 * includes, for example, implementation versions:
 * - v1.2.3
 * - v1.2.4
 * - v1.3.0-rc1
 * - v1.3.0
 * 
 * but excludes, for example, implementation versions:
 * - v1.2.2
 * - v2.0.0-rc1
 * - v2.0.0
 * - v2.0.1
 * - v3.0.0
 */ 

struct implementation_version_filter {
   std::string implementation_name; // Name to match against implementations of Antelope (e.g. "leap").
                                    // This filter does not match implementation if its name does not match this string.
   std::vector<version_range> version_ranges_to_include; // implementations with version included within any of the version ranges in list are matched by the filter
};

struct protocol_feature_filter {
   std::vector<eosio::checksum256> missing_protocol_features; // implementations that do not support one or more of the protocol features (identified by their digest) in this list match the filter
};

using node_filter = std::variant<implementation_version_filter, protocol_feature_filter>;

struct alert_v0 {
   uint64_t    alert_id; // unique identifier for the alert (regardless of issuer)
                         // which is never reused for another alert tracked in the contract
   eosio::name       issuer;
   eosio::time_point start_time; // should be the block timestamp time when this alert was created
   eosio::time_point expiration;

   std::vector<node_filter> filters; // filters to determine which nodes are impacted by the alert
                                     // (the node is impacted if its implementation matches any of the filters in the list)

   uint32_t timeout_sec; // time (in seconds) after alert is raised that impacted nodes should be in timeout
   std::string alert_message; // alert message to display to impacted nodes 
};

using alert_type = std::variant<alert_v0>;

// Data structures for inputs and outputs of the getalerts action:

struct get_alerts_by_id_input_v0 {
   eosio::name                          issuer;
   pagination_query<uint64_t, uint64_t> query;
};

using get_alerts_by_id_input_type = std::variant<get_alerts_by_id_input_v0>;

struct get_alerts_by_id_output_v0 {
   std::optional<pagination_metadata> metadata;
   std::vector<alert_type>            results;
};

using get_alerts_by_id_output_type = std::variant<get_alerts_by_id_output_v0>;

get_alerts_by_id_output_type getalerts(get_alerts_by_id_input_type input);

The getalerts action allows looking up one or more actions that are issued by a specified issuer (std::get<get_alerts_by_id_output_v0>(input).issuer) using a pagination mechanism. The pagination mechanism allows the caller of the action to get the matching actions in order (either in ascending or descending alert_id order, which due to the nature of the way alert_id works also corresponds to the order in which alerts were created).

In the returned output, assuming the results vector is not empty, a pagination_metadata is included which will have first_cursor and last_cursor values. In the case of the getalerts actions, these cursor values will actually be alert_ids. Regardless of which order the pagination was requested in, the results vector from the getalerts output will always be sorted in ascending order of alert_id and, assuming the results vector is not empty, first_cursor will be the alert_id of the first item in the results vector and last_cursor will be the alert_id of the last item in the results vector.

The comments of the fields for the alert_type (and related) data structures explain what they are and what purpose they serve for the alert. However, it is worth mentioning the following. First, timeout_sec can be 0 which means that this is an alert that does not cause any nodes to become unavailable for any duration of time. Second, the expiration time supersedes the duration of a timeout indicated by timeout_sec meaning that once the alert expires, any timeouts caused by that alert on the node should end unless some other unexpired alert is causing a timeout to the node. Finally, getalerts should return all tracked nodes even if they have already expired (even though chain_plugin will ultimately ignore those alerts). This is because clients will need some way to know which expired alerts are still tracked in state and should ideally be garbage collected.

The standard also requires that the contract provides a mechanism to add new alerts but does not require a particular interface for how to do that. It simply requires that within the action that creates the new alert, the contract ensures that it:

requires the authority of the issuer of that new action;
chooses a unique alert_id that is greater than any other alert_id previously assigned to an alert tracked by the contract;
ensures the start_time of the alert is the current block time at the time the alert is being added;
ensures that the expiration of the alert is strictly greater than the start_time plus 5 minutes;
and, emits an alertadd inline action with the alert_id chosen for the new alert and the same issuer.

The standard also requires that the contract provides a mechanism to remove an existing alert (whether expired or not). The standard does not specify the interface for this action to remove an existing alert, but it requires that:

the issuer of the alert to be removed is provided as an authority for the action;
the start_time of the alert plus 5 minutes is strictly less than the current time;
and, if and only if the removed alert had not yet expired, then the action would create an alertremove inline action with the alert_id and issuer set to the same values as that of the removed alert.

The standard recommends the contract provide some action that allows anyone to garbage collect expired alerts and also recommends that the action to add a new alert does a little bit of work in garbage collecting expired alerts (garbage collected two expired alerts, if possible, within the action to add a new alert is recommended).

The standard does not prevent other actions and capabilities to be added to the contract, but it requires that the contents of the alerts tracked in state remain immutable. If there is a desire to update the contents of an unexpired alert, the old alert_id should no longer be used, an alertremove action should be emitted for the old alert_id, the new contents should be associated with a brand new alert_id, an alertadd action should be emitted for the new alert_id, and the start_time in the new contents of the alert must be updated to the current block time.

chain_plugin changes

With this feature enabled, chain plugin will watch out for alertadd and alertremove actions in the contract on the eosio.alert account within traces of executed transactions. It will attempt the decode its input payload according to alert_event_type. If decoding is not possible, it will silently ignore processing the event. Otherwise, it will check the value of the issuer field to see if it matches the list (an issuer name of eosio.alert always matches); see later discussion for how the node can be configured to match against other names. If issuer does not match, it will skip this event. Otherwise, it uses the value of alert_id (and possibly state lookups) to keep its own internally tracked list of alerts (which can be ephemeral and not durably stored on disk) in sync with the contract state.

If the event is from alertadd, the chain plugin will lookup the alert details using the getalerts action. If it doesn't exist it can skip this event. The standard does not want to require the contract to implement state stored in tables in a particular way. Instead it enforces how that state can be accessed via the getalerts action. This means that the chain_plugin will need to create and execute speculative transaction which it will discard rather than commit to state after the transaction from a signed block that generated the alertadd action (either immediately after the transaction or later at the end of the block, but the queries must occur prior to the start of the next block). Because the standard requires that the expiration is strictly greater than the start_time and disallows removing alerts unless the start_time is strictly less than the current block time, it is guaranteed that the alert referenced by the event will still be in the state by the end of that block (though no guarantees exist beyond that point).

Once chain_plugin gets the alert details it can use the filters in the filters field to determine whether it applies to this node. See the section further below to learn more about how the alert filters can be used for particular use cases. If it doesn't match, this event can be skipped. Otherwise, the alert will need to be tracked by chain_plugin in its ephemeral in-memory state. The alert with its associated contents is not immediately added to the ephemerally tracked active alerts within the memory of the chain_plugin but is instead added into a staging queue. The alerts are moved from this queue into the ephemerally tracked active alerts only when the block they originated from becomes irreversible.

If the event is from alertremove and the issuer matches, the chain plugin will add into another queue a pair consisting of the block number of the block where the alertremove originated from and the alert_id from the alertremove event. Then as the blocks become irreversible, it will appropriately consume items from that queue and handle them by removing any alert from its ephemerally tracked active alerts in memory that matches the alert_id.

The two queues mentioned above should be appropriately trimmed from the back on fork switches so that information from other branches that is not part of the current canonical chain does not unintentionally carry over and ultimately impact the ephemerally tracked active alerts.

On startup, the chain_plugin should use getalerts to lookup all unexpired alerts from the contract state for the eosio.alert issuer as well as any other matching issuers that are configured for the chain_plugin. It should partition these alerts into two sets based on their start_time: those with a start_time less than or equal to timestamp of the last irreversible block, and those with a start_time greater than that timestamp. The first set of alerts forms the initial state for the ephemerally tracked alerts in chain_plugin memory and the second set is added into the queue which may be further pushed into from alertadd events.

Note that this may not accurately reconstruct the ephemerally tracked alerts of another node with similar configuration that had been running prior for a while, because the newly started node would initiate state as of the (reversible) head block which has already applied removals initiated in recent (still reversible blocks). This means that the side-effect of removing the alert would be applied earlier for the restarted node than for the node that had been running for a while. Even worse, if a fork switch happens soon after startup, it is possible for some alerts that would remain active from the perspective of a node correctly following the canonical blockchain from a much earlier point in time to not be present at all in the recently restarted node (at least until it restarts again). These edge cases are mitigated though because the standard enforces a 5 minute delays between when an alert is created and when it first can be removed (or garbage collected). So if the delta in block timestamp between the head block and last irreversible block remains less than 5 minutes (as it usually does on the EOS Network, for example) then these edge cases would not be encountered at all. Given the mitigations and the fact that encountering these edge cases seems fairly harmless given the nature of what these alerts are trying to accomplish, it may be worth it accepting the edge cases as part of this design rather than constructing a more sophisticated design which makes those edge cases impossible but makes other compromises (e.g. putting the ephemeral tracked alert state into a database like Chainbase, or requiring the state to be unwound back to irreversible on startup and then replayed forward to initiate the chain_plugin state dealing with tracking alerts accurately).

Side-effects of active alerts

While an alert is active, we want to provide some way for the chain operator to see the state of currently active alerts through some API (perhaps this will need to be through the producer API or else we provide endpoints for the chain API that are meant to be kept for private operator use rather than exposing to the public?).

If the alert has a timeout, the timeout end time is determined by adding timeout_sec seconds to the start_time. The actual wall-clock time (not the block timestamp of the current head block) should then be used to determine when the timeout ends because during timeout the node may not be able to continue synchronizing blocks. Alternatively, we could provide a softer version of timeout to accomplish the same goals laid out in the motivation section by still allowing the node to synchronize and validate blocks but to simply disallow access to the endpoints in the chain API (or maybe all but the one to monitor the status of alerts?).

When a new alert becomes active, the nodeos logs should indicate that happened and should also include the alert_message in the output. Perhaps this event should also be logged in other plugins that may exist in nodeos at the time like the Prometheus plugin.

If nodeos is in the middle of a timeout due to an alert, that fact should probably be repeated in the logs periodically (along with the alert_message) to make it clear to the node operator just looking at the tail of the logs what is going wrong with the node and (hopefully through the alert_message) what action they need to take to resolve it.

Subscribing to other alert issuers

We can optionally choose to go further and allow the node operator to provide a list of additional alert issuers (by their Antelope account name) that chain_plugin will respect in terms of matching on alert events. The eosio.alert account will always be included as part of this match list whether explicitly included by the node operator or not.

This allows other organizations to use the eosio.alert contract to signal alerts to their nodes assuming they are willing to make the existence of such alerts public on the blockchain. An organization may wish to use this feature to find any problems in their infrastructure due to old nodes they may have forgotten to update. And they may wish to do this on their own schedule with their own timeouts without relying on the schedule and plan of the BPs of the blockchain.

The BPs of the blockchain would still be able to force alerts on the nodes syncing that blockchain assuming they had authority (e.g. through the BP multisig) to satisfy the permissions of the eosio.alert account.

Use cases for the different alert filters

There are two types of alert filters that can be matched against: implementation_version_filter and protocol_feature_filter.

The implementation_version_filter allows the alert issuer to issue an alert intended to match on versions of a particular implementation (identified by name, e.g. leap) assuming the implementation follows the typical <major>.<minor>.<patch> version structure. If the implementation also supports a version suffix (e.g. -rc1) the version filter can still work and it will be handle the ordering (to determine if it is within the appropriate range or not) in the same way that version ordering works in semantic versioning.

The implementation_version_filter is designed to be used to send alerts to versions of software that are known to have critical bugs that have been fixed in a later release. The alert mechanism allows sending alerts to those nodes with a message indicating the need to upgrade to a new version. The timeout can even be used to temporarily make those nodes unavailable if the alert issuer deems that to be a safer course of action to protect the network than allowing the nodes to remain available and vulnerable.

The protocol_feature_filter is particularly useful to test the preparation of any nodes in the network that have not yet updated to support an upcoming protocol upgrade. After some time that a stable release of the implementation(s) of Antelope supporting the new protocol features has been made available, BPs may wish to issue an alert without any timeout but with an appropriately constructed protocol_feature_filter to send an alert to any nodes that have not been updated to support the necessary protocol features (identified by their digests) that they intend to activate in the near future with a message urging them to upgrade by some soft deadline to avoid any disruption to their nodes. Then when the soft deadline arrives, the BPs can again send a similar alert except this time it can have a short timeout (perhaps of 1 day) to put the pressure on any procrastinating node operators who have still not updated. The message can also indicate the date of the hard deadline when the protocol features will be activated and at which point the nodes must have been updated otherwise they will remain indefinitely offline.

[x] #296

AntelopeIO / leap