ChainSafe / gossamer

🕸️ Go Implementation of the Polkadot Host
https://chainsafe.github.io/gossamer
GNU Lesser General Public License v3.0
426 stars 111 forks source link

investigation: Westend node sync speed decreased after 18M blocks synced #4095

Open EclesioMeloJunior opened 1 month ago

EclesioMeloJunior commented 1 month ago

Task summary

Gossamer is reaching #18748928 block. However for some reason the sync speed decreased from 30bps to 1bps, the sync is not impacted is just the block exec/import that is taking too long.

Other information and links

Here is the stdout logs of the running node, as you can see the node is taking too long to execute the batch of 7k blocks image

EclesioMeloJunior commented 1 month ago

After a careful investigation using the pyroscope (grafana tool to analyse profiling metrics) I can see that the bottleneck is related to node.encodeChildrenOpportunisticParallel which looks the most time consuming tasks leading to the sync slow down. I initially thought that this could be related to ext_crypto_sr25519_verify_version_2 but looking into different graphs times, and also the Self x Total values looks like this is a problem related to encoding the trie.

Given the fact that our trie is completely in memory right now, while executing and changing the current trie we will need to encode it to generate the state root hash which will be used to validade the state transition function, and since we don't optimize the trie to re-hash the modified parts (I believe this is already done given the lazy load trie ) we endup need to encode the whole trie every imported block (which slows down every time the state trie grows)

Here is a nice article to understand the self vs total metric values https://grafana.com/docs/pyroscope/latest/view-and-analyze-profile-data/self-vs-total/

image (1)