Open simerplaha opened 2 years ago
Maybe someone else should run this IO count locally? Incase I'm making some mistake, 1000 queries is a bit too high to be true.
1000 queries per second, that's indeed way too high. Is it the number after syncing?
One thing of interest is here, data fetched from multiple nodes is concurrently overwriting data created by one another other
Actually we are not fetching same data. In our sharded blockchain, we can run a clique of multiple nodes and each node is responsible for a subset of chain indexes. In order to get all data we need to query every node.
the nodeUris
is given on start when fetching the clique info.
Maybe someone else should run this IO count locally?
yes I'll do it
Is it the number after syncing?
It's during sync.
Node is here
2022-09-14 17:27:38,757 [dispatcher-13] INFO o.a.f.h.BlockChainHandler - hash: 00000000001ad72c6e0119d8072e609edbea958816fb480b1296a832a44df68b; chain: 2->3; height: 121577/121577; total: 1939669; target hashrate: 22970052 MH/s; blockTime: 23385ms; #tx: 1
Explorer was a fresh start - clean DB. Graph data is for first 10 minutes after the start.
In our sharded blockchain, we can run a clique of multiple nodes and each node is responsible for a subset of chain indexes.
This is always super impressive to hear. Very cool!
1000 QPS during syncing makes great sense for me. We sync hundreds of blocks per second.
I don't have a nice graph, but while syncing from scratch I have the same result as Simer. Once it's fully synced we get down to around 5-10 queries per second
I don't have a nice graph, but while syncing from scratch I have the same result as Simer. Once it's fully synced we get down to around 5-10 queries per second
This looks pretty decent&reasonable to me at current stage!
I’m not sure how this looks reasonable.
Around 10 would still be high but acceptable for any similar application that is not real-time for writes and is primarily doing writes as background batch process that has the luxury of being efficient.
We are doing 100x what’s reasonable, which is way on the extreme end.
But ok. How do we progress? Close it?
Since it's just me that is clear, we for sure cannot close it. I might be wrong.
Let's say we sync 200 blocks per second and each block needs 5 IO queries. Then in total the number of query per second is 1000. Am I missing anything?
Or do you mean the query is for the number of HTTP requests?
200 blocks per second where each blocks needs 5 IO queries should not need 1000 IO queries, it only needs 5 queries in total, 1 query per 200 blocks.
So only 5 queries per second, not 1000.
Normally it shouldn’t even need 5 queries. For most cases, anything one wants to achieve can be done in a single query. Unless for some special case where it might need more than one, maybe two.
Or do you mean the query is for the number of HTTP requests?
I mean database queries.
the main bottleneck is the update of the main chain, to be consistent we need for each block to check if the parent is in the db and if it's part of the main chain, if not we need to update the parent and to the same for its parent. For sure there are room for improvement, but just wanted to raise that it's not just inserting the data.
As @tdroxler mentioned, those 200 blocks are handled sequentially for now and therefore inserted one by one.
We could optimize it and batch the writes of the 200 blocks, but
So, I'd treat this as premature optimization for now.
1000 Query per second should not be a bottleneck for Postgresql, and the IO cost is not an issue either since we are using self-host Postgresql now.
Sure fellas. Your call.
One thing we could do right now is to get the number of queries per block, and see if we could reduce it.
The costly part is probably this one where we do a lot of back and forth to the DB to get the parent and make sure the main chain is correct.
As we discussed @simerplaha one of the first step we could do is to bring BlockFlowSyncService
closer to the DB, rather than going through the BlockDAO
, but it won't solve the number of IO issues.
Not sure yet how to avoid all those calls when checking the main chain, but maybe a recursive query? I never tried to used one tho.
@tdroxler This updateMainChain
issue can be improved greatly in this approach:
[block0, block1, ..., blockN]
, we set main_chain = true
and only call updateMainChain
once for block0
.In this way, we will reduce the number of updateMainChain
from 200
to 16
.
@polarker We use to do this, but there was some small edge cases where it could be problematic and we decided to go the heavy way to be sure everything was right. I can't fully remember them :/
Edit: Maybe one issue was that if we had another main chain and at some point we change to another fork, we need to make sure to update all blocks from previous main chain to main_chain = false
. Currently for each block we insert, we check if there are other blocks at that height and put them to false.
As far as I remember, in the past we did not calculate the longest chain for new blocks and there were edge reorg cases. We chose to update main chain for each new block as it's the simplest approach back then.
Yep that recursive query is going to be fun. Can look that after working on your previous suggestion here and if there are any others that are smaller and can be done quicker.
one of the first step we could do is to bring BlockFlowSyncService closer to the DB
Yep 👍🏼
Will use this graph as baseline and add more metrics over the next few PRs with the goal reduce the QPS. Maybe also add the total number of blocks processed after n
seconds might help see by how much sync's overall performance is effected.
Good point, we need to improve our metrics system too.
Working on this issue I kinda needed a proper IO count check because from the code it seemed high.
Looks like we are executing around 1000 queries a second, which is high.
The graph below shows the number of queries executed by
DBRunner
per second, for the first 10 minutes (while also running a node locally).If you'd like to inspect this graph, I've uploaded it here.
BlockFlowSyncService
only.Let me know if you'd like to run these IO count yourself, I will push some code so it's executable.
Is this a valid issue?
Concurrent overwrites
One thing of interest is here, data fetched from multiple nodes is concurrently overwriting data created by one another other. I'm sure this will eventually lead to the correct database state but is this concurrent mutation by design/required?