Sifchain / sifnode

SifNode - The future of Defi
Other
109 stars 118 forks source link

Sifnode load testing LPD+Rewards: Creating an increasing number of pools and liquidity provider addresses #3020

Open pandaring2you opened 2 years ago

pandaring2you commented 2 years ago

When creating an increasing number of pools and liquidity provider addresses - check relative performance (is it linearly increasing?)

Example issues that this test would catch:

Note - we shouldn't be iterating through every pool anywhere in the code (Peggy 2.0 situation when we have 10s of thousands of tokens)

Issue was raised by @daechoi and @sheokapr

jzvikart commented 2 years ago

Scenario description: https://www.notion.so/sifchain/Rewards-2-0-Load-Testing-972fbe73b04440cd87232aa60a3146c5

Bootstrapping:

  1. https://github.com/Sifchain/sifnode/blob/develop/scripts/init_w_prod_tokens.sh
  2. https://github.com/Sifchain/sifnode/blob/develop/scripts/run.sh

Open questions:

jzvikart commented 2 years ago

@timlind said about changing block time:

its part of consensus config in genesis.json timeout_commit + timeout_precommit

jzvikart commented 2 years ago

From @joneskm:

Hey everyone, at block height 7759160 we had 7,025 unique liquidity providers and 12,245 total liquidity providers in production. Both numbers are significant for load testing, the total providers determines how many times we loop and the number of unique providers determines how many transfers and events we have. So ideally we should try to match both these values. 'total' is the sum over all pools of the number of liquidity providers for each pool. 'unique' is the number of unique addresses amongst this 'total'

jzvikart commented 2 years ago

@joneskm's script for querying unique and total LPs:

#!/bin/bash

set -eu

NODE="https://rpc-archive.sifchain.finance:443"

rm -f all-lps.txt

for i in {0..61} #61 manually checked - the 62nd is empty!
do
   echo "Processing page $i"
   sifnoded q clp all-lp --limit=200 --height 7759160 --node $NODE --output json --page $i | jq -r '.liquidity_providers | .[] | .liquidity_provider_address' >> all-lps.txt
done

cat all-lps.txt | sort | uniq > unique-lps.txt

wc -l all-lps.txt
wc -l unique-lps.txt
jzvikart commented 2 years ago

@joneskm If we wanted to "apply" the numbers from production to the test, we would have to run the test with these parameters:

Thus, each of 7025 wallets would provide liquidity to 2 pools (chosen randomly out of 100 total pools), so we would have 7025 unique and 7025 * 2 = 14050 total liquidity providers.

The test implicitly assumes:

jzvikart commented 2 years ago

@daechoi mentioned that one of main concerns is also the gossip time. We should try to run this test with multiple nodes.

Can anybody show me how to set up multiple nodes via CLI?

pandaring2you commented 2 years ago

MVP eta tomorrow. Tests using 11k wallets should no longer take 21 hours (now 5 minutes) thanks to work from Caner and Jure.

Note: Rewards 2.0 is released, but not turned on. Want to load test, then toggle it on.

jzvikart commented 2 years ago

Test iteration 1:

Results:

Next steps:

jzvikart commented 2 years ago

Test iteration 2:

Next step:

jzvikart commented 2 years ago

Test iteration 3:

jzvikart commented 2 years ago

Test iteration 4:

Scenario: 100 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost

Results:

Logs are available on request.

jzvikart commented 2 years ago

Test iteration 5:

Setup: 5 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost

Results:

jzvikart commented 2 years ago

Test iteration 6:

Scenario: 100 pools / 8000 wallets / 3 liquidity providers per wallet, single node running on localhost Branch: fix/lppd_rewards_block_time (63830c223c520a4e1841b6c2b8de73d13e3e8db8)

Results:

pandaring2you commented 2 years ago

Jure raised issue found in test iteration 4 with Jon. Sifnode has a fix that Jure is testing

pandaring2you commented 2 years ago

Test now run, still need confirmation that the results are as expected. Now need to re-run with test with multiple nodes.

pandaring2you commented 2 years ago

Note from 7/22/2022 load testing meeting: Once this test is resolved. Will reconvene load testing meeting to examine results and determine additional load tests.

jzvikart commented 2 years ago

Test iteration 7:

Parameters:

Results:

ghost commented 2 years ago

@jzvikart what are the errors reported on the 500s on block_results?

bksifchain commented 2 years ago

@jzvikart - Add nodes to scale up to the production set of ~125 nodes eventually and lets see the results

jzvikart commented 2 years ago

@sifag The errors look like this:

$ curl http://127.0.0.1:26657/block_results?block=nnn
500 Internal Server Error

I haven't checked sifnoded logs, there might be more information about it there.

jzvikart commented 2 years ago

Open questions about multinode implementation:

daechoi commented 2 years ago

1, A<->B<->C<->A

  1. Would like to know effect on response time for hitting different node query and txns. Hit A for txn, A for query Hit A for txn, B for query Hit A for txn, C for query
  2. Let’s put C on different region.
jzvikart commented 2 years ago

Due to time pressure we decided to ignore the effects of network topology and latency in multi-node setup for now. We will run the tests according to @gzukel's script, and we will decide later if we need to do more exploration.

More information about multinode setup (Luca):

pandaring2you commented 2 years ago

ETA few days for Test Iteration 7

jzvikart commented 2 years ago

@sifag found some additional information about HTTP 500 errors for block_results. The full response actually looks like this:

{
  "jsonrpc": "2.0",
  "id": -1,
  "error": {
    "code": -32603,
    "message": "Internal error",
    "data": "could not find results for height #883"
  }

The current hypothesis is that even though a new block is already being reported by sifnoded status, it might take some time for the data to become available through get_results.

To check this, we will modify the test to explicitly pass height parameter of current_block - 1 instead of nothing (which defaults to current_block). If the hypothesis is true, we expect to see a significantly reduced number of those HTTP 500 errors.

ghost commented 2 years ago

Hypothesis is correct: I saw exactly zero 500s on my machine

jzvikart commented 2 years ago

There was a bug in previous versions of the test. Number of liquidity pools was fixed to 10 instead of taking it from --number-of-liquidity-pools. Because of this, the test might have been giving off wrong results, in particular the measured block time might have been lower than it should be.

Fixed in commit d0658972811e2bb1bbdca0a013fd8002d28fa492.

jzvikart commented 2 years ago

Test iteration 8:

Summary: as in 7, but rebased on current master + fixes

Parameters:

Results:

Findings:

jzvikart commented 2 years ago

Test iteration 9:

Summary: test code was almost completely rewritten to support multinode implementation, so we are running the new code with the same scenario as in iteration 8. We expect the results to be the same.

Results:

jzvikart commented 2 years ago

We found a bug in the rewards calculation code on v0.13.6 that halts the chain because of a broken assertion.

sifnoded.log: ERR CONSENSUS FAILURE!!! err="negative coin amount" module=consensus

We were able to reproduce the bug with the following parameters:

../integration/framework/venv/bin/python3 test_many_pools_and_liquidity_providers.py \
    --number-of-nodes 1 \
    --number-of-wallets 8000 \
    --number-of-liquidity-pools 100 \
    --liquidity-providers-per-wallet 3 \
    --reward-period-default-multiplier 1.0 \
    --reward-period-distribute \
    --reward-period-pool-count 100 \
    --reward-period-mod 1 \
    --lpd-period-mod 1 \
    --phase-offset-blocks 3 \
    --phase-duration-blocks 99999999 \
    --block-results-offset 1 \
    --run-forever | tee test_many_pools_and_liquidity_providers.log

Under these conditions, the error happens approx. 1.5h into from the start of the test.

pandaring2you commented 2 years ago

Note: as there have been several releases since this was last worked on, Jure will need to revisit adapting these tests with Pradeep and Caner. Results from 8K wallet test indicate that there are still more performance issues for certain scenarios (Dae says this happens when several events are getting generated).

Primary goal is still to have load tests for potential issues that may arise in production (LPD+Rewards) and future feature releases (PMTP cashback, peggy2, etc.).

pandaring2you commented 2 years ago

Ex. Block time went up to 23 seconds, but currently having difficulty reproducing.

pandaring2you commented 2 years ago

Tests fixed and currently running - @jzvikart to check in on estimates of when tests will be fixed.

jzvikart commented 2 years ago

Test with 100k wallets and 3 LP/wallet failed to set up. The command "query clp lplist" is consistently timing out with Error: post failed: Post "http://127.0.0.1:10286": EOF.

Most likely cause is a large number of results.

pandaring2you commented 2 years ago

@jzvikart to file a ticket with sifnode for the issue above. This is the first time we are running a test with 100K wallets (300K LPs)

pandaring2you commented 2 years ago

On hold. Jure re-assigned to helping test margin.

pandaring2you commented 2 years ago

Note: since we're outsourcing to 3rd party providers, they might be more optimized. in that case, this issue is of less importance, but still a concern we'd like to run by someone on sifnode.

pandaring2you commented 1 year ago

Need to discuss revised load testing plan w/Peggy team. Where are we at with the current iterations, what needs to change.

pandaring2you commented 1 year ago

https://docs.google.com/spreadsheets/d/1hqO_VSk3VeaBwFe9XEdfSUEof1reQmG7qsg4EL7duP0/edit#gid=0

pandaring2you commented 1 year ago

On hold - even at 10X of current traffic levels, we should be fine on Peggy2. @jzvikart - can you re-run the load tests to double check they are working? Once this is done, we can close this task out (and should not have any addition load test work for peggy2). If we run into sifnode load issues - will create a separate issue.