casper-ecosystem / developer-rewards

A place where developers can get rewarded for their contribution to the Casper Ecosystem and Docs
Apache License 2.0
5 stars 1 forks source link

in the July-August 2023 some unknown conditions suddenly arose to such a severe degree for as minimum 100 nodes changed their uptime indicators dramatically. This workout should help. #42

Closed mathematicw closed 11 months ago

mathematicw commented 11 months ago

Reward Size in USD

400 USD

Reward category

Other

Description

What happened (abstract)

We all are observing a strange behavior of the nodes uptime estimation tool. About 100 nodes that had excellent performance and uptime - i.e. were hitting 700 every or most of weeks in a year (except for the well-known 3-7 days when the whole network was lagging) suddenly started to lag out of the blue in August (since week#6), so it was no longer possible to make "700" for them, and therefore they were no longer paid at all, or paid very sparingly. Some nodes looks just surprizingly: went through fire and water and didn't even sneeze in common laggy days but suddenly stumbled every day in August.

The network had common and unavoidable lag days due to events such as:: Jan 28 to Feb 2 - attack event (mention in the group: https://t.me/CasperTestNet/22552); Feb 21 - 1.4.13 upgrade; Feb 28 - March 1 - many people lost their LP during attack on port 8888 (source IPs: 185.234.210.155; 82.1.51.142 and others.. ) April 12 - /doesn't seem to depend on the host/ May 4 - 1.4.15 Upgrade June 21 - mass Germany located Hetzner servers lag. (yet 1 in Finland though)

New events that took place during the period examined: July 6 - firewall update, whitelisting 3.91.157.200 for scoring tool July 17 (Q3,week#3) - upgrade 1.5.2 Note: week#5 has more payed nodes as it includes grace period (from Jul31 to Aug2) August 1 - firewall update, whitelisting 3.80.27.246 for scoring tool August 7 - a date with an unknown event, after which many nodes became heavily laggy

On the day of the 1.5.2 update - July 17 - many nodes lost longevity points (LP) and some even went offline, some required up to 2 consecutive days for the update, so July 17-18 are not considered as lags. This is more of an operator negligence. And marked as 'missed upgr'.

The usual causes of lags are weak server configuration, server/node misconduction or oversite, network problems at different parts of the testnetwork. And it is obvious that in the month of August any of these conditions could not suddenly arise to such a severe degree that hundred of nodes would change their indicators dramatically. So in this study, I'm going to assume that the "node uptime" metrics that are the result of the node scoring tool survey do not reflect actual node uptime.

* In any case, in the attachment there is a list of all problematic nodes in the network with comprehensive data for evaluating network performance for any further research. And if you need logs of those nodes, we can post the list of needed PubKeys directly in the testnet telegram group and explicitly their operators to do upload.

Note

The only mean to register those lags is "Casper Testnet Participant Scores" spreadsheet. But this spreadsheet is not published right at the end of each week, and so I (as well as everyone else in the network) did not have the opportunity to spot abnormalities in time, so this study may not already contain fully relevant data. In addition to that, be noted, I was evaluating Validator/KeepUp status as of September 8 and Sept. 12. And after making my suggestion in the testnet telegram group, many people may have activated the bid on the validators auccion so this data may already become irrelevant (and indeed there is such a movement on the netю You can see the signs of racing right here https://testnet.cspr.live/validators)

Given the long experience of the test network, 1 lag on some date common to all participants can be considered normal behavior. So even the July 17 lag will not be taken into account when evaluating the performance level of the node (as well as the lag of February 21, May 4, June 21, etc.) 'Performance level' means 'Good' or 'Bad'.

Sources used: Casper Testnet Participant Scores - 2023 Q1, Q2 and Q3 (as of Sept.6) spreadsheets. + CNM (https://cnm.casperlabs.io/network/casper-test/detail) as of September 8-12.

i try to answer some questions:

Some conclusions:

Even either with '1 lag' cases or without them, KeepUp nodes has more lags in July-August period. Which can be related to scoring tool IP change or 1.5.2 itself or something other. But we can monitor further.

* Of course, it is adequate and logical to say that Node Validator operators are already more responsible and more interested in the testnet, and most likely they have better servers (there is more load on the validator), and therefore they have fewer lags. But still such correlations are the majority. 

The thing we want you to pay attention to and explain how it works:

  1. Are nodes polled while they are finalizing a block? If so, during this time the node may experience additional load and return a bad result, even though it has excellent servers.

  2. And one of the burning questions (and one that has no normal explanation in the period under study ) is a number of nodes, which despite impeccable performance lost longevity on flat ground at July 17. (but.. seems you've already fixed it)

  3. And finally, as we are currently experiencing a sort of attack - certain IPs (and they seem to be the same for the whole network) are spamming some nodes. But not all of them. But you're aware of the case, but there's no solution yet.

    Thank you for evaluating my job


    lag_research.ods

Acceptance Criteria

If it turns out that the problem is not due to errors in the uptime tool, then it is necessary to analyze the traffic that spams nodes and develop a more accurate defense against such an attack using a firewall. Create documentation on the principle and details of the node uptime estimation tool, because people wonder what is going on in the black box. This attack or bug, whatever it is, is causing me severe financial problems, so I would like some transparency. With this nodes analysis you have more free time and we can request logs of all problematic nodes.

NicolasZoellner commented 11 months ago

Dear @mathematicw,

Thank you for your proposal. Our technical experts have reviewed it. Nevertheless, we appreciate that you are putting together a good summary. We are currently unable to solve it from the CA, and this is something that falls outside the scope of the DevReward program. The developers from CL are currently working on that issue.

Therefore, I have to decline this proposal and reject it for DevReward consideration.

Nevertheless, we highly appreciate your input. I recommend you to seek such solutions in the main Discord Channel within the Dev Section.

Best regards, Nicolas Zöllner