dziabko commented 2 years ago

Goal

We want to modify manta-collator-selection pallet to:

Monitor block producing data of collators
Use this data to kick off candidate collators (excluding invulerables, which is a set of collators ran by ourselves) based on certain conditions

Block Producing Data

A simple and effective block producing data is how many block this collator signed during in a section. It is easy to implement and is a good metric to measure how well a collator does in block production.

Evicting Condition

Evicting condition need to set to remove abnormal collators, this condition should be relative robust, for example:

When all the collators are all doing well, we should not evict any collators. Thus, a simple percentile based approach would not work. We don't want to create a hostile environment for collator runners.
It should detect outliers reliably, so that the block production is healthy.

Reference Design

Acala Collator Evicting Logic: https://github.com/AcalaNetwork/Acala/blob/master/modules/collator-selection/src/lib.rs#L529

Garandor commented 2 years ago

An initial idea (havent checked Acala yet) with some napkin math

tl;dr: Check closing thoughts, Acala's solution is probably better than my proposed algorithm

Assumptions for calamari:

We want to keep overall chain block times within 5% tolerance of 12s, no action if within.
Calamari RT specifies min 8 , max 55 collators (50 candidates + 5 invuln)
A chosen set of validators is active for an era of 6 hours
pallet_collator_selection::assemble_collators() seems to make no distinction between collator candidates and collators. A candidate will author at least 1 block in an era due to aura and blocks_per_era being larger than max_collator_candidates
We can not interrupt an active era and force a collator rotation
Kicking once per day ( 4 eras ) is good enough for small infractions ( 5-10% slower blocks overall)
We kick only when a collator is significantly underperforming, e.g. missing > 20% of his assigned blocks AND chain performance is affected significantly ( e.g. if every collator is underperforming by 5%, the total network was slowed by 5%, but there is no reason to kick any individual )
If blocktime is impacted significantly ( >10% network slowdown ), the worst collator is kicked after every era until it is back within 10%
if collator is fully down ( 0 blocks authored in an era), kick him regardless of overall network impact

in storage, keep ~an average of~ the number of missed blocks per collator over a day, i.e. store blocks_assigned_collator and blocks_missed_collator

⇒ two u32s ⇒ 8 byte per collator ⇒ 440B max extra storage + a blocktimestamp ( not sure but likely larger, u128 ? ) marking the next end of the day ⇒> 456B extra storage

Rough algorithm:

at end of era kick collator with largest blocks_missed_collator, if: blocks_missed_collator == blocks_assigned_collator || blocks_missed_total * 20 > blocks_per_era && ( blocks_missed_collator * 5 > blocks_assigned_collator || ( is_end_of_day && blocks_missed_collator * 10 > blocks_assigned_collator ))
if the to-be-kicked collator is an invulnerable, print warning and rerun the calculation with the second-worst collator
If the network is >5% slowed, but no collator could be kicked (only invulnerables are underperforming, min collator count is reached, or nobody dropped more than 20% of their blocks), issue a warning event
reset all counters after 4 eras / 1 day

Napkin math

Assuming only one collator is offline:

blocks_per_era = 6 hours / 12 S_PER_BLOCK = 1800 blocks
blocks_assigned_collator_per_era = 225 (8 collators) OR ~32 (55 collators)
one collator produces 0 blocks: chain blocktime impact: -12,5% OR -1.77%
- regeneration time: 6 hours
3 collators miss half their blocks:
- slowdown at 8 collators: 18.75% → no action, just warning
- slowdown at 10 collators: 15% → 11% (6 hours later) → 6.25% (6 hours later) → can’t kick, just warn → average slowdown over the full day: 9.6%
- slowdown at 55 collators: 3% → no action
Going offline and coming back on within 12 hours (2 eras) won’t get punished if the network overall is healthy enough ( < 10% slowdown or it’s the end of the day )

Caveats

This ratelimits kicking to 1 collator per 6 hours, preventing mass ejections but slowing down response to multiple failures
This simple average is more sensitive to misbehavior in the early eras of a day ( possibly use a sliding window here?)

closing thoughts

If y’all think this simple check is promising enough i can graph the behavior. i’m thinking an exponential reaction to deteriorating overall network health could be desirable (e.g. checking for multiple kicks if overall health is bad), but it seems like overkill here.

Also, a 20% missed blocks ratio before kicking might be too high, even if overall network health is okay. I believe we could expect collators to stay <10% dropped blocks during normal operation after accounting for maintenance downtimes.

Acala seems to be using an outlier-from-average kicking method, which is simpler than above idea ( using a single u32 per account ("session points"), no time tracking ) and applying a threshold to average instead of to target/perfect performance. It's probably better than above idea as averaging performance prevents from starting to kick people if everyone's performance was bad due to some systemic problem over the era. However, in their solution it'd be conceivably possible (in a permissionless system - which this change isn't aiming to support - where we run 10 nodes and the community 90) that average performance is low without any individual reaching the kicking threshold - calculated from average performance. So this doesn't maximize performance, it just removes negative outliers from the set. Since that'll is exactly what this ticket's title is, we can just go with their solution ;)

LMKWYT

stechu commented 2 years ago

I don't think using average is a good idea, the root reason is that average is not a Robust Statistics.

Let me give an adversarial example, assume 100% is the goal to reach.

node 1	node 2	node 3	node 4	node 5
100%	100%	20%	20%	20%

Now, the average is 52%. So now node 3, 4, 5 are more than 50% below average (any reasonable threshold will cause them get kicked!). And they are not even outliers.

A better idea is to use median (median is a robust statistics) or some percentile of block producing count as the baseline for evicting.

The mean is not a robust measure of central tendency. If the dataset is e.g. the values {2,3,5,6,9}, then if we add another datapoint with value -1000 or +1000 to the data, the resulting mean will be very different to the mean of the original data. Similarly, if we replace one of the values with a datapoint of value -1000 or +1000 then the resulting mean will be very different to the mean of the original data.

The median is a robust measure of central tendency. Taking the same dataset {2,3,5,6,9}, if we add another datapoint with value -1000 or +1000 then the median will change slightly, but it will still be similar to the median of the original data. If we replace one of the values with a datapoint of value -1000 or +1000 then the resulting median will still be similar to the median of the original data.

Described in terms of breakdown points, the median has a breakdown point of 50%, meaning that half the points must be outliers before the median can be moved outside the range of the non-outliers, while the mean has a breakdown point of 0, as a single large observation can throw it off.

bhgomes commented 2 years ago

Median is great.

Garandor commented 2 years ago

Now, the average is 52%. So now node 3, 4, 5 are more than 50% below average (any reasonable threshold will cause them get kicked!). And they are not even outliers.

Algorithm working as intended to not remove nodes that aren't outliers - it maintains uniformity of the set, not performance. Changing "outlier" to "underperformer" makes it clear that the averaging solution isn't acceptable.

I think the main question to answer is whether we want to benchmark against perfect or median performance. While median performance does not guarantee restoring blocktime to full, it prevents punishment of individuals for systemic problems. If "too many collators can slow down the network" is a true statement, I would prefer not to use the median and kick collators until the optimum count for the network is restored ( if systemic slowdown is 20% or more, otherwise my proposed algorithm does nothing - which should give enough time to notice the problem and notify people )

I just noticed that i mistakenly mentioned "average" in my proposed algorithm although it doesn't actually do any averaging, it's a thresholding solution that kicks nodes that signed less than 80% of their blocks when the network is slowed down by 5% or more. Editing it.

~The session points system might work for us too reducing storage from 2 u32 per collator to 1, if we drop the "kick once per day if network is beetween 5-10% slowdown" and just kick once per era in general, 6h as a trial period seems long enough for good operators to fix transient problems with their nodes and would collapse above rule to~

~kick at end of each era if: session_points_collator == 0 || points_in_session_total * 1.05 < max_session_points && ( session_points_collator * 1.25 * num_of_collators < points_in_session_total) ( multiply just to avoid divisions, cleartext would be session_points_collator < 80% * (points_in_session_total / num_of_collators) )~

~points_in_session_total = sum of all session_points_collator max_session_points = theoretical max of perfect performance over all collators~

EDIT: session_points average naively applied won't work as collators will be assigned different amounts of blocks ( 1 block difference max due to aura), acala are ignoring this difference - impact of 1 block at max collator set size on calamari is 3%: total_session_point.checked_div(candidates_len)

This would be clearly communicable:

if, after registering, your collator drops more than 20% of its assigned blocks, it is at risk of automatic removal

The obvious problem with this is that it only restores blocktime to within 5% and if all collators are 81% effective, nobody gets evicted. The former might be acceptable (or the 5% guard is simply removed) the latter is a systemic problem that should be solved in communication with the permissioned collator set

This leaves otherwise perfect nodes a window of 72minutes of tolerated downtime per 6hr session - so at worst, if all nodes colluded to drop 19% of their blocks, the network would run 19% slower without any automated remedies. Since we stay permissioned, this won't happen.

stechu commented 2 years ago

There is a parameter control the maximum number of collators, so I think controlling the number of collators should not be the goal of the evicting logic. (This also makes the design much simpler).

Also, it is impossible to rely on this kind of evicting logic to provide any sort of guarantee of the final block time, say, if almost all the collators are under-performing, no evicting logic can provide guarantee of block time. So this should be a no goal. With that said, the goal here is really do the best to remove the underperformed outlier collator in a robust manner.

I think median (or some kind of percentile measure) is a good baseline for this logic. We need to figure out which exact threshold to use to kicking off collator here. We can decide that after getting some data. However, I worry that the data could be a bit anecdotal. I would say a statistical robust measure plus some real data evidence should be good enough.

BTW: Through personal communication with Acala team, their code face the exact issue I pointed out (sometimes they kick off all the third party collators).

Garandor commented 2 years ago

Okay with me, as long as we accept synchronous performance degradation of the whole collator set as a non-action case for this topic here. I agree that I see no simple way to have both only outliers punished AND regain full performance.

Does subscan have historic block data queryable? We could pull block stats of calamari or acala (or any other aurachain) off of there and basically backtest several parameters against historic performance data to see what situations we'd be kicking someone. Expanding the block data set with missed blocks from timestamps and inferring the collator from its place in the round robin set could be a bit of work though.

Since this is both a low-accuracy and a low-criticality measure (it's basically just an effort saver to autokick the worst offenders without bothering sudo/governance), simply setting the divergence-from-median-to-kick intuitively high to begin with and iterating from there seems like a good enough solution.

One thing to consider is that if median drops from 100% to 70%, a static threshold of 30% will shift down too, making us accept 40% block performance without kicking the offender instead of 70% - i'm not sure we want that. I'd at least add a scaling factor to the kick threshold, e.g.

median ~100% => threshold 50%
median = 50% => threshold 10%
threshold stays at 10% if median goes even lower

Garandor commented 2 years ago

After discussion, we selected the below solution for an initial implementation, the concrecte behavior of which is to be benchmarked against real-world collator performance stats later on.

Configurables

n-th Percentile ( initial value = 80 )
y% Performance Threshold ( initial value = 90 )

Approach

Calculate n-th percentile of collator performance
Select the n-th percentile of collators
Kick every collator in the set whose performance is below y% of the n-th percentile

Manta-Network / Manta

Evicting Underperforming Collators #335

Goal

Block Producing Data

Evicting Condition

Reference Design

Assumptions for calamari:

Rough algorithm:

Napkin math

Caveats

closing thoughts

Configurables

Approach