Open pandaring2you opened 2 years ago
Scenario description: https://www.notion.so/sifchain/Rewards-2-0-Load-Testing-972fbe73b04440cd87232aa60a3146c5
Bootstrapping:
Open questions:
How do we measure block time?
Can we change block time?
If the time for calculations fits into block time, does the time of calculations still influence block time? (i.e. are blocks mined on a timer with a fixed frequency, or does the timing depend on the time of calculations?)
@timlind said about changing block time:
its part of consensus config in genesis.json timeout_commit + timeout_precommit
From @joneskm:
Hey everyone, at block height 7759160 we had 7,025 unique liquidity providers and 12,245 total liquidity providers in production. Both numbers are significant for load testing, the total providers determines how many times we loop and the number of unique providers determines how many transfers and events we have. So ideally we should try to match both these values. 'total' is the sum over all pools of the number of liquidity providers for each pool. 'unique' is the number of unique addresses amongst this 'total'
@joneskm's script for querying unique and total LPs:
#!/bin/bash
set -eu
NODE="https://rpc-archive.sifchain.finance:443"
rm -f all-lps.txt
for i in {0..61} #61 manually checked - the 62nd is empty!
do
echo "Processing page $i"
sifnoded q clp all-lp --limit=200 --height 7759160 --node $NODE --output json --page $i | jq -r '.liquidity_providers | .[] | .liquidity_provider_address' >> all-lps.txt
done
cat all-lps.txt | sort | uniq > unique-lps.txt
wc -l all-lps.txt
wc -l unique-lps.txt
@joneskm If we wanted to "apply" the numbers from production to the test, we would have to run the test with these parameters:
--number-of-wallets 7025
--number-of-liquidity-pools 100
(== number_of_tokens)--liquidity-providers-per-wallet 2
(== number of tokens per wallet)Thus, each of 7025 wallets would provide liquidity to 2 pools (chosen randomly out of 100 total pools), so we would have 7025 unique and 7025 * 2 = 14050 total liquidity providers.
The test implicitly assumes:
native_amount == external amount
)@daechoi mentioned that one of main concerns is also the gossip time. We should try to run this test with multiple nodes.
Can anybody show me how to set up multiple nodes via CLI?
MVP eta tomorrow. Tests using 11k wallets should no longer take 21 hours (now 5 minutes) thanks to work from Caner and Jure.
Note: Rewards 2.0 is released, but not turned on. Want to load test, then toggle it on.
Results:
add-genesis-account
turned out to be very slow (>5h) since each invocation has to load, parse, modify and save the genesis file. The time needed for this operation is increasing progressively with every added account.Next steps:
sifnoded query clp lplist
crashes the test because the number of results exceed page sizeNext step:
Scenario: 100 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost
Results:
block_results
occured 1961 times
coin_spent
: ~8000, lppd/distribution
: ~8000, transfer
: ~8000, message
: ~8000sifnoded query bank balances
was still working)coin_received
which was 6 instead of 5. Is this expected?Logs are available on request.
Setup: 5 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost
Results:
block_results
occured 2001 times, >90% failing when LPPD is active, OK otherwiseScenario: 100 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost Branch: fix/lppd_rewards_block_time (63830c223c520a4e1841b6c2b8de73d13e3e8db8)
Results:
block_results
:
Jure raised issue found in test iteration 4 with Jon. Sifnode has a fix that Jure is testing
Test now run, still need confirmation that the results are as expected. Now need to re-run with test with multiple nodes.
Note from 7/22/2022 load testing meeting: Once this test is resolved. Will reconvene load testing meeting to examine results and determine additional load tests.
Parameters:
Results:
@jzvikart what are the errors reported on the 500s on block_results?
@jzvikart - Add nodes to scale up to the production set of ~125 nodes eventually and lets see the results
@sifag The errors look like this:
$ curl http://127.0.0.1:26657/block_results?block=nnn
500 Internal Server Error
I haven't checked sifnoded logs, there might be more information about it there.
Open questions about multinode implementation:
1, A<->B<->C<->A
Due to time pressure we decided to ignore the effects of network topology and latency in multi-node setup for now. We will run the tests according to @gzukel's script, and we will decide later if we need to do more exploration.
More information about multinode setup (Luca):
ETA few days for Test Iteration 7
@sifag found some additional information about HTTP 500 errors for block_results
. The full response actually looks like this:
{
"jsonrpc": "2.0",
"id": -1,
"error": {
"code": -32603,
"message": "Internal error",
"data": "could not find results for height #883"
}
The current hypothesis is that even though a new block is already being reported by sifnoded status
, it might take some time for the data to become available through get_results
.
To check this, we will modify the test to explicitly pass height
parameter of current_block - 1
instead of nothing (which defaults to current_block
). If the hypothesis is true, we expect to see a significantly reduced number of those HTTP 500 errors.
Hypothesis is correct: I saw exactly zero 500s on my machine
There was a bug in previous versions of the test. Number of liquidity pools was fixed to 10 instead of taking it from --number-of-liquidity-pools
. Because of this, the test might have been giving off wrong results, in particular the measured block time might have been lower than it should be.
Fixed in commit d0658972811e2bb1bbdca0a013fd8002d28fa492.
Summary: as in 7, but rebased on current master + fixes
Parameters:
--number-of-liquidity-pools
: yesheight = current_block - 1
for block_results
Results:
Findings:
http://.../block_results
by explicitly passing ?height=x
where x is current_block - 1
Summary: test code was almost completely rewritten to support multinode implementation, so we are running the new code with the same scenario as in iteration 8. We expect the results to be the same.
Results:
We found a bug in the rewards calculation code on v0.13.6 that halts the chain because of a broken assertion.
sifnoded.log: ERR CONSENSUS FAILURE!!! err="negative coin amount" module=consensus
We were able to reproduce the bug with the following parameters:
../integration/framework/venv/bin/python3 test_many_pools_and_liquidity_providers.py \
--number-of-nodes 1 \
--number-of-wallets 8000 \
--number-of-liquidity-pools 100 \
--liquidity-providers-per-wallet 3 \
--reward-period-default-multiplier 1.0 \
--reward-period-distribute \
--reward-period-pool-count 100 \
--reward-period-mod 1 \
--lpd-period-mod 1 \
--phase-offset-blocks 3 \
--phase-duration-blocks 99999999 \
--block-results-offset 1 \
--run-forever | tee test_many_pools_and_liquidity_providers.log
Under these conditions, the error happens approx. 1.5h into from the start of the test.
Note: as there have been several releases since this was last worked on, Jure will need to revisit adapting these tests with Pradeep and Caner. Results from 8K wallet test indicate that there are still more performance issues for certain scenarios (Dae says this happens when several events are getting generated).
Primary goal is still to have load tests for potential issues that may arise in production (LPD+Rewards) and future feature releases (PMTP cashback, peggy2, etc.).
Ex. Block time went up to 23 seconds, but currently having difficulty reproducing.
Tests fixed and currently running - @jzvikart to check in on estimates of when tests will be fixed.
Test with 100k wallets and 3 LP/wallet failed to set up. The command "query clp lplist" is consistently timing out with Error: post failed: Post "http://127.0.0.1:10286": EOF
.
Most likely cause is a large number of results.
@jzvikart to file a ticket with sifnode for the issue above. This is the first time we are running a test with 100K wallets (300K LPs)
On hold. Jure re-assigned to helping test margin.
Note: since we're outsourcing to 3rd party providers, they might be more optimized. in that case, this issue is of less importance, but still a concern we'd like to run by someone on sifnode.
Need to discuss revised load testing plan w/Peggy team. Where are we at with the current iterations, what needs to change.
On hold - even at 10X of current traffic levels, we should be fine on Peggy2. @jzvikart - can you re-run the load tests to double check they are working? Once this is done, we can close this task out (and should not have any addition load test work for peggy2). If we run into sifnode load issues - will create a separate issue.
When creating an increasing number of pools and liquidity provider addresses - check relative performance (is it linearly increasing?)
Example issues that this test would catch:
Note - we shouldn't be iterating through every pool anywhere in the code (Peggy 2.0 situation when we have 10s of thousands of tokens)
Issue was raised by @daechoi and @sheokapr