in the July-August 2023 some unknown conditions suddenly arose to such a severe degree for as minimum 100 nodes changed their uptime indicators dramatically. This workout should help.

Reward Size in USD

400 USD

Reward category

Other

Description

What happened (abstract)

We all are observing a strange behavior of the nodes uptime estimation tool. About 100 nodes that had excellent performance and uptime - i.e. were hitting 700 every or most of weeks in a year (except for the well-known 3-7 days when the whole network was lagging) suddenly started to lag out of the blue in August (since week#6), so it was no longer possible to make "700" for them, and therefore they were no longer paid at all, or paid very sparingly. Some nodes looks just surprizingly: went through fire and water and didn't even sneeze in common laggy days but suddenly stumbled every day in August.

The network had common and unavoidable lag days due to events such as:: Jan 28 to Feb 2 - attack event (mention in the group: https://t.me/CasperTestNet/22552); Feb 21 - 1.4.13 upgrade; Feb 28 - March 1 - many people lost their LP during attack on port 8888 (source IPs: 185.234.210.155; 82.1.51.142 and others.. ) April 12 - /doesn't seem to depend on the host/ May 4 - 1.4.15 Upgrade June 21 - mass Germany located Hetzner servers lag. (yet 1 in Finland though)

New events that took place during the period examined: July 6 - firewall update, whitelisting 3.91.157.200 for scoring tool July 17 (Q3,week#3) - upgrade 1.5.2 Note: week#5 has more payed nodes as it includes grace period (from Jul31 to Aug2) August 1 - firewall update, whitelisting 3.80.27.246 for scoring tool August 7 - a date with an unknown event, after which many nodes became heavily laggy

On the day of the 1.5.2 update - July 17 - many nodes lost longevity points (LP) and some even went offline, some required up to 2 consecutive days for the update, so July 17-18 are not considered as lags. This is more of an operator negligence. And marked as 'missed upgr'.

The usual causes of lags are weak server configuration, server/node misconduction or oversite, network problems at different parts of the testnetwork. And it is obvious that in the month of August any of these conditions could not suddenly arise to such a severe degree that hundred of nodes would change their indicators dramatically. So in this study, I'm going to assume that the "node uptime" metrics that are the result of the node scoring tool survey do not reflect actual node uptime.

* In any case, in the attachment there is a list of all problematic nodes in the network with comprehensive data for evaluating network performance for any further research. And if you need logs of those nodes, we can post the list of needed PubKeys directly in the testnet telegram group and explicitly their operators to do upload.

Note

The only mean to register those lags is "Casper Testnet Participant Scores" spreadsheet. But this spreadsheet is not published right at the end of each week, and so I (as well as everyone else in the network) did not have the opportunity to spot abnormalities in time, so this study may not already contain fully relevant data. In addition to that, be noted, I was evaluating Validator/KeepUp status as of September 8 and Sept. 12. And after making my suggestion in the testnet telegram group, many people may have activated the bid on the validators auccion so this data may already become irrelevant (and indeed there is such a movement on the netю You can see the signs of racing right here https://testnet.cspr.live/validators)

Given the long experience of the test network, 1 lag on some date common to all participants can be considered normal behavior. So even the July 17 lag will not be taken into account when evaluating the performance level of the node (as well as the lag of February 21, May 4, June 21, etc.) 'Performance level' means 'Good' or 'Bad'.

Sources used: Casper Testnet Participant Scores - 2023 Q1, Q2 and Q3 (as of Sept.6) spreadsheets. + CNM (https://cnm.casperlabs.io/network/casper-test/detail) as of September 8-12.

i try to answer some questions:

How many of all kinds of nodes are in the Network at the moment of the study (~Sept. 10)? 264 - total 214 - are 'Good's, 'Average's, 'bad-good's and 'good-short's (123 - Validators; 83 - KeepUp or ReadOnly; 8 - Inactive bid) 192 - Good (have 700 points in most weeks) 2 - Average (still profitable, will be considered as 'Good') 10 - bad-good (seems have improved hardware) 10 - good-short (good, but worked less than 3 months - started in June - August) 50 - unuseful nodes (with lags every week or too offten stops, have no reliable info) 194 - good + average, are useful and reliable nodes for evaluating network issues.
How many of reliable 'Good' nodes - i.e. excepting 'short' and 'bad-good' showed lagging in different periods?
What number of lag-events?
What are their states? Note: 'missed upgr' cases will not be counted.

lags in July 6 -16 Validators: 8 nodes ; 8 lag events if deduct '1 lag' events: 1 node; 1 event KeepUp: 2 nodes ; 3 lag events 1 node; 2 events Inactive bid: 1 nodes ; 2 lag events 1 node; 2 events

lags in July 17-31 Validators: 36 nodes ; 115 lag events if deduct '1 lag' events: 26 nodes; 105 event KeepUp: 51 nodes; 167 lag events 43 nodes; 159 events Inactive bid: 2 nodes ; 8 lag events 1 node; 7 events

lags in August 1-6 Validators: 7 nodes; 8 lag events if deduct '1 lag' events: 1 node; 2 events KeepUp 10 nodes; 10 lag events) 0 nodes; 0 events Inactive bid 0 nodes; 0 lag events) 0 nodes; 0 events

lags after August 7.. Validators: 35 nodes; 343 lag events if deduct '1 lag' events: 28 nodes; 335 events KeepUp: 52 nodes; 505 lag events 49 nodes; 499 events Inactive bid: 0 nodes; 0 lag events 0 nodes; 0 events

Total

39 Validator, not considered bad-performance, nodes suffered in July-August with 443 lag events ( '1 lag' cases were not taken into account) 57 ReadOnly,not considered bad-performance, nodes suffered in July-August with 660 lag events ( '1 lag' cases were not taken into account) and 2 NotActive bid nodes with 9 lags events

As we can see, even either with '1 lag' cases or without them, KeepUp nodes has more lags.
How many nodes had no any changes in July nor August? How many of them are Validators? Is there a correlation with Hoster or region? Note: 'missed upgrade' cases will not be taken into account, again, as they are not due to the network faults/events/issues etc..

65 Validator nodes had absolutely no lags in whole July-August period - green colored (and even 2 more Validator nodes are among 'bad-good' class) 16 Read-only nodes had absolutely no lags in whole July-August period - green colored
How many nodes which lost LP at July 17 lagged / had no lags in further period? - they are marked as "missed upgr." in my spreadsheet.
- 17 Validator nodes have lost LP at 1.5.2 upgrade and had no more lags
- 9 ReadOnly nodes have lost LP at 1.5.2 upgrade and had no more lags
How many nodes which saved LP at July 17 lagged / had no lags in further period?
- 46 validator (+ 2 validator from 'bad-good' performance level class')
- 10 read-only nodes
Does Hoster influence the performance? problematic nodes are hosted at: Hetz Germany, Hetz Finland, Alabanza Finland and Innowacyjne Rozwiazania Informatyczne Poland Hetzner Finland nodes shows worse results than Germany in summary AWS - mostly better performance, but samples are too small

Some conclusions:

Even either with '1 lag' cases or without them, KeepUp nodes has more lags in July-August period. Which can be related to scoring tool IP change or 1.5.2 itself or something other. But we can monitor further.

* Of course, it is adequate and logical to say that Node Validator operators are already more responsible and more interested in the testnet, and most likely they have better servers (there is more load on the validator), and therefore they have fewer lags. But still such correlations are the majority.

The thing we want you to pay attention to and explain how it works:

Are nodes polled while they are finalizing a block? If so, during this time the node may experience additional load and return a bad result, even though it has excellent servers.
And one of the burning questions (and one that has no normal explanation in the period under study ) is a number of nodes, which despite impeccable performance lost longevity on flat ground at July 17. (but.. seems you've already fixed it)
And finally, as we are currently experiencing a sort of attack - certain IPs (and they seem to be the same for the whole network) are spamming some nodes. But not all of them. But you're aware of the case, but there's no solution yet.

Thank you for evaluating my job

lag_research.ods

Acceptance Criteria

If it turns out that the problem is not due to errors in the uptime tool, then it is necessary to analyze the traffic that spams nodes and develop a more accurate defense against such an attack using a firewall. Create documentation on the principle and details of the node uptime estimation tool, because people wonder what is going on in the black box. This attack or bug, whatever it is, is causing me severe financial problems, so I would like some transparency. With this nodes analysis you have more free time and we can request logs of all problematic nodes.

casper-ecosystem / developer-rewards