emeraldpay / dshackle

Fault Tolerant Load Balancer for Ethereum and Bitcoin APIs
Apache License 2.0
298 stars 65 forks source link

get_block() request very slooow. #253

Open tomih opened 11 months ago

tomih commented 11 months ago

I benchmarked dshackle and found that get_block (by number or by hash) and with included transactions is terribly slow.

Here are the average results for last 50 blocks: Block average response time: dshackle on node: 1526.1953592300415 ms dshackle on local machine: 1431.77649974823 ms node: 135.94377040863037 ms

Other calls, like block.latest, get_transaction_receipt or getBalance return expected results (ie. add few ms latency). Can somebody explain why getBlock is so slow and how to tackle the problem?

edit: get_block call without transactions is fast as expected.

Thnx, Tomi

splix commented 11 months ago

Are you requesting them with full transaction details or just hashes? Do you have the redis cache enabled?

tomih commented 11 months ago

I'm requesting them with full transactions and don't use redis cache. I just tested requests via ws (eth_getBlockByHash) with full transactions and it is fast as expected.

splix commented 11 months ago

Did you make the WS request through Dshackle as well? If yes, can you test it again restarting Dshackle before the test, to make sure it doesn't use any internal caching?

tomih commented 11 months ago

Yup, ws requests go through Dshackle. The current test is: I subscribed to newHeads and on update, immediately request eth_getBlockByHash with full transactions. Even If I restart Dschackle, results are very close to direct connection to node. Here are my results:

+--------------------------------------+---------------------+--------------------+---------------------+---------------------+---------------------+-----------+
|             Block Number             |       lb-local      |     lb-gcloud      |        node-1       |        node-2       |        infura       | lb-node-1 |
+--------------------------------------+---------------------+--------------------+---------------------+---------------------+---------------------+-----------+
|     New Head for block 17813705      |  0.0558011531829834 | 3.7076148986816406 | 0.29482221603393555 | 0.06288886070251465 |  2.518273115158081  |    0.0    |
| Get Block By Hash for block 17813705 | 0.19244098663330078 | 1.8641810417175293 | 0.09901094436645508 |  0.1606767177581787 | 0.12928104400634766 |    0.0    |
+--------------------------------------+---------------------+--------------------+---------------------+---------------------+---------------------+-----------+

lb-local: Dschackle on local machine lb-gcloud: Dschakle on gcloud lb-node-1: Dschackle on same server as node-1 The results are diff in ms from fastest response.

edit:lb-local is the one I restarted before the test.

splix commented 11 months ago

hm, so, just to confirm your findings. When you requests eth_getBlockByHash(number, true) though HTTP it's slow, but when you do the same request through WS it shows just a small overhead, right? Do you request a recent block in the both cases?

I suspect that the slow response may happen for an old block, and a request for a recent block may work better. Maybe that the reason of the difference.

PS I see why Dshackle can be slow for this particular request, but I don't understand why WS and HTTP work differently. That's why I'm asking this to better understand where to look at.

tomih commented 11 months ago

Everything you stated is correct. getBlock is slow over http and almost no overhead over ws. And it's also true, that ws test is requesting only the latest block where http test is fetching last 50 blocks.

I will create identical tests for ws and http and let you know. thnx for fast response.

tomih commented 11 months ago

You were right. The ws test was fast, because it fetched cached / latest block. Here are the results, http vs ws:

Block Number    HTTP Benchmark (ms)  WS Benchmark (ms)
========================================================================================================================
17818823       1707.369089126587   1827.8157711029053
17818822       2408.0188274383545  2526.607036590576
17818821       189.23497200012207  323.7130641937256
17818820       1394.942045211792   1447.16215133667
17818819       677.4783134460449   1110.3830337524414
17818818       260.17189025878906  120.99218368530273
17818817       4065.1087760925293  3823.888063430786
17818816       806.1330318450928   1268.7101364135742
17818815       2133.568048477173   2440.6509399414062
17818814       134.89723205566406  271.1009979248047

You mentioned that you know why get block with txs is slow. Can you elaborate? Are blocks with full tx not cached? Is there a way to overcome that?

tomih commented 11 months ago

I also tried with redis and results are the same.

splix commented 11 months ago

I had some wrong assumptions for caching this type of requests. Dshackle processes the block and transactions inside it independently. It works fine for the recent blocks, because most of the data in the memory. Then this data gets removed from this "hot data" cache, but the secondary cache doesn't have it because it's unpopular data, so it wastes a lot of time checking each transaction independently. I'll need to rewrite this part to make it fast as it supposed to be.

tomih commented 11 months ago

Is there any ETA for the fix? Or maybe some pointers how to properly fix the problem? I also noticed very high CPU load even when dshackle is idle (no incoming request). I traced it down to scheduler and assume there is a culprit. thnx, T.

splix commented 11 months ago

I've fixed it in master.

Do you think you can try with that version? It's not uploaded to docker, but you could build it locally (join our Discord, if you have questions)

splix commented 11 months ago

btw, how do you measure CPU usage and what is the high load in your case?

tomih commented 11 months ago

Hi. Sorry for late feedback (i was on holidays). I applied the fix but results seem the same:

get full block:
Direct node connection:
1. Current block: 17962783, block_size: ' 273033' bytes, block_time: ' 472.039' ms
2. Current block: 17962782, block_size: '  41464' bytes, block_time: ' 131.553' ms
3. Current block: 17962781, block_size: '  62969' bytes, block_time: ' 134.895' ms
4. Current block: 17962780, block_size: ' 157421' bytes, block_time: ' 174.418' ms
5. Current block: 17962779, block_size: ' 217300' bytes, block_time: ' 290.471' ms

Dshackle:
1. Current block: 17962783, block_size: ' 273033' bytes, block_time: '3765.081' ms
2. Current block: 17962782, block_size: '  41464' bytes, block_time: ' 246.481' ms
3. Current block: 17962781, block_size: '  62969' bytes, block_time: ' 257.941' ms
4. Current block: 17962780, block_size: ' 157421' bytes, block_time: '1866.621' ms
5. Current block: 17962779, block_size: ' 217300' bytes, block_time: '2581.968' ms

For CPU, I'm just looking @ top (on 16 cpu machine / no requests - idle ):

Screenshot 2023-08-21 at 11 25 02
AJolly commented 10 months ago

Having the same issue, specifically for eth_getBlockByNumber" params":["latest",true].

I'm using @MysticRyuujin's dappnode package. @tomih What are you using to benchmark, I'll try it against my instance.

tomih commented 10 months ago

I wrote simple python script: https://gist.github.com/tomih/177c40722b06a4de94aff9fa27376b70 Just change / add nodes to nodes variable. Let me know the results.