Define Testing Parameters for Releases for future regression testing automation

philknows commented 2 years ago

After the regression in performance of v0.34.0, it is clear that we will need a further, more comprehensive testing process for future releases which includes our various environments, setups and specific parameters we should ensure catches critical regressions in node performance. The end goal of this is to eventually automate this process in our testing environments and automatically detect regression in our defined metrics for passing beta release testing.

Our minimal testing environments include a combination of the following:

Hardware Resource Diversity Requirements:

Hetzner-powered servers (high-end performance devices)
Contabo-powered servers (low-end performance devices)

Validator Set Requirements:

Servers with minimal validators attached (<10)
Servers with many validators attached (>500)

Environment Requirements:

Lodestar instances on docker setups
Lodestar instances on bare metal setups.

A combination of each of these environments should be tested for no shorter than three days and compared with the previous stable version to analyze regression in any metrics. The following metrics checklist will determine criteria required to pass our release beta testing:

[ ] Slot chart shows consistent consensus
[ ] Stable peer count
[ ] Clock drift not greater than 5
[ ] No sync status issues
[ ] Head drift is not consistently behind (>1)
[ ] Heap space used is constant and flat with no signs of memory leaks
[ ] External memory used is constant and flat with no signs of memory leaks
[ ] Process heap bytes is constant and flat
[ ] GC pause time rate does not exceed 20%
[ ] Prometheus scrape duration does not show signs of slower response time
[ ] Validator inclusion distance at 1 as much as possible. (% of avg delays compared to previous version)
[ ] Compare validator correct head percentage average to previous release.
[ ] Compare validator previous epoch ATTESTER and TARGET miss ratios to previous release. Should be as low as possible
[ ] Compare epoch transition average time to previous release. Should be as low as possible.
[ ] Compare epoch transition utilization rate to previous release. Should be as low as possible.
[ ] Compare process block average time to previous release. Should be as low as possible.
[ ] Compare process block utilization rate to previous release. Should be as low as possible.
[ ] Compare epoch transition count per epoch to previous release. Should be at 1 as much as possible.
[ ] Compare process block per slot to previous release. Should be at 1 as much as possible.

List is a draft and WIP.

philknows commented 2 years ago

@dapplion When going through the metrics for analysis on the new beta, can you contribute to this checklist so in the future we can think about finding a way to automate? Similar to what you did for benchmarking?

philknows commented 2 years ago

From Apr 1 planning meeting, extra notes in regards to planning out testing infrastructure:

Dade: For testing infrastructure, issues don't usually show up unless we run for a couple of days, could affect velocity. How should we manage this?

Phil: We should define how an ideal testing infrastructure should look like. We want to put the work into a good process, but make sure it's accurately giving us metrics/data that helps find potential issues.

Phil: Do we need a controlled devnet environment to do testing in where we can influence parameters to simulate potential issues? Or is that too much work which doesn't reflect realities of public testnets and mainnet?

Cayman: Devnets are too small, things scale with validators and it's difficult to reproduce this on our own devnet. Should we focus on how a devnet would be beneficial for preliminary type tests such as sending/receive messages and not getting banned immediately, Beacon API endpoints, etc? Would this infrastructure be worth it? Will take additional servers and such.

Cayman: Maybe there's a way to set slots per second higher? Compress the amount of work? Change chain params to have it happen quicker?

Dade: The problem is not necessarily functional bugs, but rather performance regression. These are usually seen in a longer period of time. We should invest more in metrics monitoring. Should be some balance resources in setting up better metrics alongside testing infrastructure.

Phil: True. We should ensure the work we put into a testing infrastructure yields good data and results for it to be worth the work. If we do setup a testnet in a controlled environment, it should probably focus more on functional type testing to ensure our Beacon APIs are not broken for example. And push more of the testing infrastructure to help test against a more real-world environment like Prater. It should be focused to help relieve some grunt work (automation).

philknows commented 1 year ago

Closing for #4724

ChainSafe / lodestar

Define Testing Parameters for Releases for future regression testing automation #3747