Sifnode load testing LPD+Rewards: Creating an increasing number of pools and liquidity provider addresses

pandaring2you commented 2 years ago

When creating an increasing number of pools and liquidity provider addresses - check relative performance (is it linearly increasing?)

Example issues that this test would catch:

Rewards 2.0 loops over all pools.
At the start of every block - iterate through every pool to calc spot price

Note - we shouldn't be iterating through every pool anywhere in the code (Peggy 2.0 situation when we have 10s of thousands of tokens)

Issue was raised by @daechoi and @sheokapr

jzvikart commented 2 years ago

Scenario description: https://www.notion.so/sifchain/Rewards-2-0-Load-Testing-972fbe73b04440cd87232aa60a3146c5

Bootstrapping:

Open questions:

How do we measure block time?
Can we change block time?
If the time for calculations fits into block time, does the time of calculations still influence block time? (i.e. are blocks mined on a timer with a fixed frequency, or does the timing depend on the time of calculations?)

jzvikart commented 2 years ago

@timlind said about changing block time:

its part of consensus config in genesis.json timeout_commit + timeout_precommit

jzvikart commented 2 years ago

From @joneskm:

Hey everyone, at block height 7759160 we had 7,025 unique liquidity providers and 12,245 total liquidity providers in production. Both numbers are significant for load testing, the total providers determines how many times we loop and the number of unique providers determines how many transfers and events we have. So ideally we should try to match both these values. 'total' is the sum over all pools of the number of liquidity providers for each pool. 'unique' is the number of unique addresses amongst this 'total'

jzvikart commented 2 years ago

@joneskm's script for querying unique and total LPs:

#!/bin/bash

set -eu

NODE="https://rpc-archive.sifchain.finance:443"

rm -f all-lps.txt

for i in {0..61} #61 manually checked - the 62nd is empty!
do
   echo "Processing page $i"
   sifnoded q clp all-lp --limit=200 --height 7759160 --node $NODE --output json --page $i | jq -r '.liquidity_providers | .[] | .liquidity_provider_address' >> all-lps.txt
done

cat all-lps.txt | sort | uniq > unique-lps.txt

wc -l all-lps.txt
wc -l unique-lps.txt

jzvikart commented 2 years ago

@joneskm If we wanted to "apply" the numbers from production to the test, we would have to run the test with these parameters:

--number-of-wallets 7025
--number-of-liquidity-pools 100 (== number_of_tokens)
--liquidity-providers-per-wallet 2 (== number of tokens per wallet)

Thus, each of 7025 wallets would provide liquidity to 2 pools (chosen randomly out of 100 total pools), so we would have 7025 unique and 7025 * 2 = 14050 total liquidity providers.

The test implicitly assumes:

we're only dealing with symmetrical liquidity (i.e. native_amount == external amount)
total number of different tokens == total number of liquidity pools
number of different tokens per address == number of liquidity providers per address

jzvikart commented 2 years ago

@daechoi mentioned that one of main concerns is also the gossip time. We should try to run this test with multiple nodes.

Can anybody show me how to set up multiple nodes via CLI?

pandaring2you commented 2 years ago

MVP eta tomorrow. Tests using 11k wallets should no longer take 21 hours (now 5 minutes) thanks to work from Caner and Jure.

Note: Rewards 2.0 is released, but not turned on. Want to load test, then toggle it on.

jzvikart commented 2 years ago

Test iteration 1:

number_of_liquidity_pools: 100
number_of_wallets: 8000
liquidity_providers_per_wallet: 3
test_duration_blocks 100
single node

Results:

Creating 8000 accounts via add-genesis-account turned out to be very slow (>5h) since each invocation has to load, parse, modify and save the genesis file. The time needed for this operation is increasing progressively with every added account.

Next steps:

create accounts dynamically (asynchronous transactions with sequence numbers, OR
use Python to modify genesis file directly in one step

jzvikart commented 2 years ago

Test iteration 2:

setup of accounts now works (implemented via directly modifying genesis.json)
sifnoded query clp lplist crashes the test because the number of results exceed page size

Next step:

implement reading of results with pagination

jzvikart commented 2 years ago

Test iteration 3:

Setup same as above
Implemented paging
Current result: test failing because of intermittent HTTP 500 errors for http://.../block_results
Shorter test with 10 pools / 8 wallets / 3 LP per wallet works OK (no errors).
Next steps:
- implement retry logic + leave sifnoded process running so that we can debug
- investigate with Brian, James and Dae what is happening and how to debug

jzvikart commented 2 years ago

Test iteration 4:

Added some general debugging/troubleshooting logic
Dae/Brian: try first with just 5 pools, if it still fails also reduce number of wallets to 1000
Decision: keep failing request, but in case of any errors just log them and do not fail the test (we want to see the block time since this error might have a different cause, such as the amount of data returned)

Scenario: 100 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost

Results:

Test run time approx. 2.5h
During test run, HTTP 500 error for block_results occured 1961 times
- Phase "rewards only" was OK (0 errors)
- The errors started in phase "rewards + LPPD" (>90% requests failed)
- During this phase, the events per block with the highest count were typically: coin_spent: ~8000, lppd/distribution: ~8000, transfer: ~8000, message: ~8000
The chain did not halt (i.e. after the test finished and there were neither rewards nor LPPD scheduled any more, sifnoded query bank balances was still working)
Average block times over 100 blocks:
- Neither: 5.03 s/block
- Rewards only: 5.02 s/block
- Rewards + LPPD: 13.14 s/block - NOT OK
- LPPD only: 14.48 s/block - NOT OK
During phase "rewards only" there was almost no change in number of messages per block from baseline "neither". The only difference was coin_received which was 6 instead of 5. Is this expected?

Logs are available on request.

jzvikart commented 2 years ago

Test iteration 5:

Setup: 5 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost

Results:

HTTP 500 for block_results occured 2001 times, >90% failing when LPPD is active, OK otherwise
Average block time over 100 blocks:
- Neither: 5.03 s/block
- Rewards only: 5.03 s/block
- Rewards + LPPD: 13.39 s/block - NOT OK
- LPPD only: 14.63 s/block - NOT OK

jzvikart commented 2 years ago

Test iteration 6:

Scenario: 100 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost Branch: fix/lppd_rewards_block_time (63830c223c520a4e1841b6c2b8de73d13e3e8db8)

Results:

Approximate error rate for block_results:
- Neither: 0%
- Rewards only: ~33%
- Rewards + LPPD: ~75%
- LPPD only: ~66%
Average block time:
- Neither: 5.03 s/block
- Rewards only: 5.03 s/block
- Rewards + LPPD: 6.05 s/block
- LPPD only: 5.15 s/block

pandaring2you commented 2 years ago

Jure raised issue found in test iteration 4 with Jon. Sifnode has a fix that Jure is testing

pandaring2you commented 2 years ago

Test now run, still need confirmation that the results are as expected. Now need to re-run with test with multiple nodes.

pandaring2you commented 2 years ago

Note from 7/22/2022 load testing meeting: Once this test is resolved. Will reconvene load testing meeting to examine results and determine additional load tests.

jzvikart commented 2 years ago

Test iteration 7:

Parameters:

number_of_liquidity_pools: 100
number_of_wallets: 8000
liquidity_providers_per_wallet: 3
reward_period_default_multiplier: 1.0
reward_period_distribute: True
reward_period_mod: 1
reward_period_pool_count: 100
test_duration_blocks: 100
number_of_nodes: 1
branch: fix/lppd_rewards_block_time (a1fdb9bcf68854401892ba1924bd56923cafbc57)

Results:

Average block time:
- Neither: 5.03 s/block
- Rewards only: 5.03 s/block
- Rewards + LPPD: 7.25 s/block
- LPPD only: 5.27 s/block
Approximate number of events from block_results:
- Rewards only: {'coin_received': 8007, 'coinbase': 3, 'coin_spent': 8005, 'transfer': 8004, 'message': 8004, 'mint': 1, 'proposer_reward': 1, 'commission': 2, 'rewards': 2, 'rewards/distribution': 8001, 'burn': 1}
- Rewards + LPPD: {'coin_received': 16008, 'coinbase': 3, 'coin_spent': 16006, 'transfer': 16005, 'message': 16005, 'mint': 1, 'proposer_reward': 1, 'commission': 2, 'rewards': 2, 'lppd/distribution': 8001, 'rewards/distribution': 8001, 'burn': 1}
- LPPD only: {'coin_received': 8006, 'coinbase': 2, 'coin_spent': 8004, 'transfer': 8004, 'message': 8004, 'mint': 1, 'proposer_reward': 1, 'commission': 2, 'rewards': 2, 'lppd/distribution': 8001}
block_results:
- HTTP 500 happened 545 times (approx. 3 out of 4 requests failed when either rewards, LPPD or both were active)
- size of response body: ~18 MB (rewards), ~36 MB (rewards + LPPD)

ghost commented 2 years ago

@jzvikart what are the errors reported on the 500s on block_results?

bksifchain commented 2 years ago

@jzvikart - Add nodes to scale up to the production set of ~125 nodes eventually and lets see the results

jzvikart commented 2 years ago

@sifag The errors look like this:

$ curl http://127.0.0.1:26657/block_results?block=nnn
500 Internal Server Error

I haven't checked sifnoded logs, there might be more information about it there.

jzvikart commented 2 years ago

Open questions about multinode implementation:

Does the topology of connections between the nodes matter?
Does it matter which node is used for sending transactions and which for querying?
Does the network latency matter?

daechoi commented 2 years ago

1, A<->B<->C<->A

Would like to know effect on response time for hitting different node query and txns. Hit A for txn, A for query Hit A for txn, B for query Hit A for txn, C for query
Let’s put C on different region.

jzvikart commented 2 years ago

Due to time pressure we decided to ignore the effects of network topology and latency in multi-node setup for now. We will run the tests according to @gzukel's script, and we will decide later if we need to do more exploration.

More information about multinode setup (Luca):

pandaring2you commented 2 years ago

ETA few days for Test Iteration 7

jzvikart commented 2 years ago

@sifag found some additional information about HTTP 500 errors for block_results. The full response actually looks like this:

{
  "jsonrpc": "2.0",
  "id": -1,
  "error": {
    "code": -32603,
    "message": "Internal error",
    "data": "could not find results for height #883"
  }

The current hypothesis is that even though a new block is already being reported by sifnoded status, it might take some time for the data to become available through get_results.

To check this, we will modify the test to explicitly pass height parameter of current_block - 1 instead of nothing (which defaults to current_block). If the hypothesis is true, we expect to see a significantly reduced number of those HTTP 500 errors.

ghost commented 2 years ago

Hypothesis is correct: I saw exactly zero 500s on my machine

jzvikart commented 2 years ago

There was a bug in previous versions of the test. Number of liquidity pools was fixed to 10 instead of taking it from --number-of-liquidity-pools. Because of this, the test might have been giving off wrong results, in particular the measured block time might have been lower than it should be.

Fixed in commit d0658972811e2bb1bbdca0a013fd8002d28fa492.

jzvikart commented 2 years ago

Test iteration 8:

Summary: as in 7, but rebased on current master + fixes

Parameters:

number_of_liquidity_pools: 100
number_of_wallets: 8000
liquidity_providers_per_wallet: 3
reward_period_default_multiplier: 1.0
reward_period_distribute: True
reward_period_mod: 1
reward_period_pool_count: 100
test_duration_blocks: 100
number_of_nodes: 1
fix/lppd_rewards_block_time: yes
fix for --number-of-liquidity-pools: yes
using height = current_block - 1 for block_results

Results:

Block time:
- Neither: 5.03 s/block
- Rewards only: 5.27 s/block
- Rewards + LPPD: 8.03 s/block
- LPPD only: 5.26 s/block
Number of HTTP 500 errors: 0
Total time: 9135 s

Findings:

Results are consistent with test interation 7 which means the bug in previous versions did not significantly impact results, i.e. previous results are still valid
We can avoid HTTP 500 errors on http://.../block_results by explicitly passing ?height=x where x is current_block - 1

jzvikart commented 2 years ago

Test iteration 9:

Summary: test code was almost completely rewritten to support multinode implementation, so we are running the new code with the same scenario as in iteration 8. We expect the results to be the same.

Results:

Block times:
- Neither: 5.03 s/block
- Rewards only: 5.32 s/block
- Rewards + LPPD: 8.37 s/block
- LPPD only: 5.25 s/block
Total time 9118 s
HTTP 500 errors for block_results: 638 (block_results_offset=0, errors present in all phases except "neither", approx. 4/5 requests failed with)

jzvikart commented 2 years ago

We found a bug in the rewards calculation code on v0.13.6 that halts the chain because of a broken assertion.

sifnoded.log: ERR CONSENSUS FAILURE!!! err="negative coin amount" module=consensus

We were able to reproduce the bug with the following parameters:

../integration/framework/venv/bin/python3 test_many_pools_and_liquidity_providers.py \
    --number-of-nodes 1 \
    --number-of-wallets 8000 \
    --number-of-liquidity-pools 100 \
    --liquidity-providers-per-wallet 3 \
    --reward-period-default-multiplier 1.0 \
    --reward-period-distribute \
    --reward-period-pool-count 100 \
    --reward-period-mod 1 \
    --lpd-period-mod 1 \
    --phase-offset-blocks 3 \
    --phase-duration-blocks 99999999 \
    --block-results-offset 1 \
    --run-forever | tee test_many_pools_and_liquidity_providers.log

Under these conditions, the error happens approx. 1.5h into from the start of the test.

pandaring2you commented 2 years ago

Note: as there have been several releases since this was last worked on, Jure will need to revisit adapting these tests with Pradeep and Caner. Results from 8K wallet test indicate that there are still more performance issues for certain scenarios (Dae says this happens when several events are getting generated).

Primary goal is still to have load tests for potential issues that may arise in production (LPD+Rewards) and future feature releases (PMTP cashback, peggy2, etc.).

pandaring2you commented 2 years ago

Ex. Block time went up to 23 seconds, but currently having difficulty reproducing.

pandaring2you commented 2 years ago

Tests fixed and currently running - @jzvikart to check in on estimates of when tests will be fixed.

jzvikart commented 2 years ago

Test with 100k wallets and 3 LP/wallet failed to set up. The command "query clp lplist" is consistently timing out with Error: post failed: Post "http://127.0.0.1:10286": EOF.

Most likely cause is a large number of results.

pandaring2you commented 2 years ago

@jzvikart to file a ticket with sifnode for the issue above. This is the first time we are running a test with 100K wallets (300K LPs)

pandaring2you commented 2 years ago

On hold. Jure re-assigned to helping test margin.

pandaring2you commented 2 years ago

Note: since we're outsourcing to 3rd party providers, they might be more optimized. in that case, this issue is of less importance, but still a concern we'd like to run by someone on sifnode.

pandaring2you commented 1 year ago

Need to discuss revised load testing plan w/Peggy team. Where are we at with the current iterations, what needs to change.

pandaring2you commented 1 year ago

https://docs.google.com/spreadsheets/d/1hqO_VSk3VeaBwFe9XEdfSUEof1reQmG7qsg4EL7duP0/edit#gid=0

pandaring2you commented 1 year ago

On hold - even at 10X of current traffic levels, we should be fine on Peggy2. @jzvikart - can you re-run the load tests to double check they are working? Once this is done, we can close this task out (and should not have any addition load test work for peggy2). If we run into sifnode load issues - will create a separate issue.

Sifchain / sifnode