JackalLabs / canine-chain

The Jackal Protocol blockchain node binary
https://jackalprotocol.com
MIT License
40 stars 17 forks source link

Report - Block Missing issue since v3.2.0 upgrade #432

Closed effofxprime closed 4 months ago

effofxprime commented 7 months ago

Consensus timeouts are possibly stopping nodes from adding tx's to mempool to process the block. Since our last upgrade, I have been attempting to troubleshoot the block missing issue that many of us validators are facing.

Much of this has been communicated privately with @TheMarstonConnell

I began to notice that when the consensus module begins for a new block, nothing happens during the timeout_propose. The longer this was set, made things worse! The idea at first was to, give the process more time to finish the mempool process so we can commit the block. Because this caused further delays, it seemed appropriate to lower this setting. Currently I have it set to 1500ms and all my delta timeouts are 300ms. This has lead to me noticeably missing a LOT less blocks.

It seems that my initial thought that I was unable to complete the block due to my older CPU arch and slower clock speeds compared to newer CPUs might not be the case. I suspect I might be able to reduce missing further by lowering the timeout even further. However, I want to run with this change for 24 hours. Gain graph stats and data.

effofxprime commented 7 months ago

My current thoughts

Since our v3.2.0 upgrade, a select few validator nodes (mine included) have seen a major change in their uptime for signing blocks. Because it began with the upgrade, I think it's safe to reason that there is something about the changes made that have introduced this problem to the network.

My equipment

PCIe Gen3.0 Server: https://www.supermicro.com/products/system/1u/1028/sys-1028u-tn10rt_.cfm Drives: https://store.supermicro.com/us_en/1-92tb-nvme-pcie4-hds-sun0-mzql21t9hcjra7.html 2x E5-2687W v4 CPUs 192GB ECC DDR4 Ram 1x 1TB nvme drive for the OS 4x 2TB nvme's for Validators

My configs and settings match that of the current defaults (from new configs) or they are otherwise specified in this post.

Over the past 3 days, I have spent considerable time watching the logs live and monitoring the problem as it happens. Attempting to 'catch' the issue and at a minimum get a good visual understanding of whats happening as it happens.

Solutions other vals have done

Unforteunately this has resulted in some providers switching their node to meet this criteria. A cheep solution for this is switching their node to run over on Hetzner, and this has also forced our peer geo locations to conecentrate in Germany. Given that the upgrade appears to have introduced a delay itself, the fact our nodes are further distancing themselves is also adding to the problem.

I am uncertain of the exact delay the upgrade is causing(IE: larger data packets from storage providers, claimstrays and how many are done in one go.. etc). I suspect that the optimzations that were made to reduce used bandwidth has some of the answers. It may not be coming from validators but Storage providers and their nodes. There does appear to be a computational delay in certain cases. Recieving IBC packets appear to be introducing delays as well. I have watched my node stop for several seconds, beyond the timeouts set, which cause me to stay or fall behind. We have yet to patch IBC and I am wondering when we do if that will resolve some of the issues. The update is non-consensus breaking and I have submitted a patch. https://github.com/JackalLabs/canine-chain/pull/431

Most recent changes

Over the last 72hours, I have discovered several changes that showed the most notable difference in staying in sync for my node. I am not posting these as recommendations to change for nodes, but information into the investigation to our current problem at hand.

Consensus Changes

timeout_propose = "1100ms"
peer_gossip_sleep_duration = "13ms"
peer_query_maj23_sleep_duration = "160ms"
  1. I first started with timeout_propose. Prior to changing this, i noticed that no other consensus/mempool actions were completing while waiting for this timeout. Specifically like, the timeout was placed infront of 'work' and so 'work' could not be started until the timeout finished. Default is 3s. Changing this setting did not result in a complete resolution of the block missing problem. However, during times where we see it happening, this has reduced the amount. I visually noticed less solid missing and saw quicker catch up to current height.
    • The Idea here was to quickly create the next block on our end and throw a nil vote right away. That way the node can focus on "work" rather than be in a "timeout" and waiting.
  2. peer gossip sleep reduction. Defaults are 300ms and 2s respectively. I have drastically lowered them. The idea was to see if it was purely a peer issue. In theory, this should force my node to query peers quite a bit during consensus to ensure I am with peers that I can submit my block signature to. These have had a noticeable effect but not as great as timeout_propose. My peer limit is currently 42. I am usually only connected to about 20-30 nodes though.
    • My peers range from ones missing blocks to those who don't.
    • Pace and I added each other as persistent peers. This unfortunately passed on a negative effect to paces uptime and I had him remove me. He is a unconditional peer for me still though. Since peering with him resulted in him missing more blocks, it stands to reason that part of this issue is my node completing the block as well. Below is an example of the longer computational times I see per block that my node forms.
Apr 08 05:30:10 artorias canined[688713]: 5:30AM INF Timed out dur=-20604.688643 height=7182060 module=consensus round=0 step=1

Here we see that I took an extra 20.6 seconds to complete that block. Our current block time is 5s. So over 25s of time passed before my node attempted to provide a proof. The interesting thing about this case is I was most likely behind. This block did not stop rounds but yet instead of being on Blocksync/Fastsync I was in Consensus.

Logs for that block:

Apr 08 05:30:10 artorias canined[688713]: 5:30AM INF Timed out dur=-20604.688643 height=7182060 module=consensus round=0 step=1
Apr 08 05:30:11 artorias canined[688713]: 5:30AM INF commit is for a block we do not know about; set ProposalBlock=nil commit=1D5E1AB29D6191C5C64CCF0DA70B94C6BA7CA3F01F5B31D35A8B210586826144 commit_round=1 height=7182060 module=consensus proposal={}
Apr 08 05:30:11 artorias canined[688713]: 5:30AM INF received complete proposal block hash=1D5E1AB29D6191C5C64CCF0DA70B94C6BA7CA3F01F5B31D35A8B210586826144 height=7182060 module=consensus
Apr 08 05:30:11 artorias canined[688713]: 5:30AM INF finalizing commit of block hash={} height=7182060 module=consensus num_txs=29 root=90A83F558E3B70A889548F76C3CE3C3105FC5716FB80724A8FD1BFCEDB7BD6D2
Apr 08 05:30:12 artorias canined[688713]: 5:30AM INF Timed out dur=1250 height=7182060 module=consensus round=1 step=3
Apr 08 05:30:23 artorias canined[688713]: 5:30AM INF executed block height=7182060 module=state num_invalid_txs=0 num_valid_txs=29
Apr 08 05:30:23 artorias canined[688713]: 5:30AM INF committed state app_hash=AF14D13B54D044BACBC5B08A3F5F46FC906BB598CAE8F1724B61BD05CED24664 height=7182060 module=state num_txs=29
Apr 08 05:30:23 artorias canined[688713]: 5:30AM INF indexed block exents height=7182060 module=txindex

Thoughts on Possible issues:

Crontab - Every hr restart

I had set my node to restart once every hour. Since doing so, I have seen less downtime missing to sign blocks in a ~24hr period. As of today, April 8th 2024, from yesterday to today, I was 500+ blocks less in my missed count after having set the to restart hourly.

This sugests to me that it's not just a peer issue.

Restarting my node has provided the biggest benefit to reduce missed blocks

Summary of last month of data gathering

From here, I am going to attempt to summarize my findings over the past month. March 30th 2024 - Val node ran with debug logging for 24hr. https://thesilverfox.pro/snapshots/jackal-march30-debug.log

Mempool

Not to long ago, we had an upgrade where we were instructed to reduce two Mempool settings. This was in response to lowering the size of blocks to 1Mb. Not all nodes did this. It could be contributing to the issue and we should consider hardcoding this in if it isn't already. Unfortunately the announcement made to make these changes is gone. The entire announcement channel is empty.

I also tried storing the mempool tx's in memory. I created a RAMDRIVE and changed the location of the wal. I didn't see a benefit from doing this. The thought was providing the fastest storage place for storing tx's, as it seemed like the delay was coming from processing that information.

Consensus

The following changes were made to consensus and have resulted in better uptimes but are not a fix/solution to the problem in my honest opinion. Screenshot 2024-04-08 055319

Blocksync / Fastsync

There appears to be a problem when nodes fall out of sync, they continue to think they are caught up and have not fallen behind. I don't see this issue with my other nodes. I suspect this has some clues as to why we're seeing what we are.

I often see instances where my node is behind. It is in consensus instead of blocksync, and applying timeouts and attempting to sign already completed blocks. Restarting fixes this in most cases, but not always. An example of what this can look like in the logs:

Mar 30 17:30:15 artorias canined[338904]: 5:30PM INF executed block height=7085197 module=state num_invalid_txs=0 num_valid_txs=7
Mar 30 17:30:15 artorias canined[338904]: 5:30PM INF commit synced commit=436F6D6D697449447B5B32323320352033322031323420313820323036203339203232382039372032353520313620313533203234302031323420333620313230203137342031313920383420323533203131382032333520363320313330203935203439203134312031333120343120363620313135203231335D3A3643314338447D
Mar 30 17:30:15 artorias canined[338904]: 5:30PM INF committed state app_hash=DF05207C12CE27E461FF1099F07C2478AE7754FD76EB3F825F318D83294273D5 height=7085197 module=state num_txs=7
Mar 30 17:30:15 artorias canined[338904]: 5:30PM INF indexed block exents height=7085197 module=txindex
Mar 30 17:30:15 artorias canined[338904]: 5:30PM INF minted coins from module account amount=6000000ujkl from=jklmint module=x/bank
Mar 30 17:30:21 artorias canined[338904]: 5:30PM INF executed block height=7085198 module=state num_invalid_txs=0 num_valid_txs=11
Mar 30 17:30:21 artorias canined[338904]: 5:30PM INF commit synced commit=436F6D6D697449447B5B343220323438203131342032313820323420353120363620363820333520313430203138392031353520313537203834203135352031383020373220323033203231332033392032343720313639203132312031373720313530203132382031343020323620373220373120323431203135375D3A3643314338457D
Mar 30 17:30:21 artorias canined[338904]: 5:30PM INF committed state app_hash=2AF872DA18334244238CBD9B9D549BB448CBD527F7A979B196808C1A4847F19D height=7085198 module=state num_txs=11
Mar 30 17:30:21 artorias canined[338904]: 5:30PM INF indexed block exents height=7085198 module=txindex
Mar 30 17:30:21 artorias canined[338904]: 5:30PM INF minted coins from module account amount=6000000ujkl from=jklmint module=x/bank
Mar 30 17:30:26 artorias canined[338904]: 5:30PM INF Ignoring message received during sync module=consensus msg={"BlockID":{"hash":"3061F6CE31D799A5E703A5DF55B315A3FA9ACC170CAE8C92BC6DD96BBEFACD04","parts":{"hash":"0317660FCD95874FF6828546467DB53A1323CB27B33F97FFAAC978F38F51C1AC","total":6}},"Height":7085140,"Round":2,"Type":2,"Votes":"xx_xx_x_xx_x__xxxxxxxxxxxxx___x_x_x_x_x_xxxxx___xx_x_x_xxxxxxxxxxx_xxxxxxxxxxxx_xx_xxx_xx_xx_xxx_xx"}
Mar 30 17:30:28 artorias canined[338904]: 5:30PM INF executed block height=7085199 module=state num_invalid_txs=1 num_valid_txs=14
Mar 30 17:30:28 artorias canined[338904]: 5:30PM INF commit synced commit=436F6D6D697449447B5B31353220343420313339203136352032323820323520313339203520313632203131322031343320313735203232342030203131382036302032322039352031313420323135203231392031363420323332203436203138322037203234392032362039352031373720323333203133305D3A3643314338467D
Mar 30 17:30:28 artorias canined[338904]: 5:30PM INF committed state app_hash=982C8BA5E4198B05A2708FAFE000763C165F72D7DBA4E82EB607F91A5FB1E982 height=7085199 module=state num_txs=15
Mar 30 17:30:28 artorias canined[338904]: 5:30PM INF indexed block exents height=7085199 module=txindex
Mar 30 17:30:28 artorias canined[338904]: 5:30PM INF minted coins from module account amount=6000000ujkl from=jklmint module=x/bank

In this snippet we can see a few things. We are in the state module, and syncing instead of trying to privde consensus, ottherwise we would see timeouts log entries. I was sent a consensus signature for block 7085140, when I was synced upto at least block 7085197. It's unclear what peer sent me the message I am ignoring. However the fact they are sending it seems to be a part of our issues. The current block when I grabbed my logs was 7085303

When the node is in the consensus module trying to provide signatures, the logs look like:

Apr 08 12:26:10 artorias canined[700708]: 12:26PM INF received complete proposal block hash=A405E60AE8D23E7B61ECC29D820F7142E1DC3F465889B76EE8D9CAC628D99FBF height=7185221 module=consensus
Apr 08 12:26:10 artorias canined[700708]: 12:26PM INF finalizing commit of block hash={} height=7185221 module=consensus num_txs=5 root=DD3CFD9CDF219AD552F0FE6EAFE5A613E93F381546BC4C95C9C0863EBDFD70D4
Apr 08 12:26:10 artorias canined[700708]: 12:26PM INF minted coins from module account amount=6000000ujkl from=jklmint module=x/bank
Apr 08 12:26:11 artorias canined[700708]: 12:26PM INF Timed out dur=1100 height=7185221 module=consensus round=0 step=3
Apr 08 12:26:12 artorias canined[700708]: 12:26PM INF executed block height=7185221 module=state num_invalid_txs=0 num_valid_txs=5
Apr 08 12:26:12 artorias canined[700708]: 12:26PM INF commit synced commit=436F6D6D697449447B5B3432203233342032313720362031383720313031203732203831203137342032343920313731203831203233302039372031303620313133203234392031323420313935203132322032333920313932203535203139332034332039312031333020323231203136312031363020323139203135305D3A3644413334357D
Apr 08 12:26:12 artorias canined[700708]: 12:26PM INF committed state app_hash=2AEAD906BB654851AEF9AB51E6616A71F97CC37AEFC037C12B5B82DDA1A0DB96 height=7185221 module=state num_txs=5
Apr 08 12:26:12 artorias canined[700708]: 12:26PM INF indexed block exents height=7185221 module=txindex

Notice the time out dur=1100, in the consensus module and showing we are going through a round/step. Above we also show that we are finalizing a commit for a block too.

Node config files

Since I have started my validator node, which I was here since launch, generated config files have changed. New options have show up and some options have changed.

config.toml

This is hardcoded IIRC and is okay to be 'missing' unless it's a change the validator/node operator wants to make.

#######################################################
###         Storage Configuration Options           ###
#######################################################

# Set to true to discard ABCI responses from the state store, which can save a
# considerable amount of disk space. Set to false to ensure ABCI responses are
# persisted. ABCI responses are required for /block_results RPC queries, and to
# reindex events in the command-line tool.
discard_abci_responses = false

app.toml

Currently, the [wasm] settings are not present.

[wasm]
# This is the maximum sdk gas (wasm and storage) that we allow for any x/wasm "smart" queries
query_gas_limit = 300000
# This is the number of wasm vm instances we keep cached in memory for speed-up
# Warning: this is currently unstable and may lead to crashes, best to keep for 0 unless testing locally
lru_size = 0

IAVL cache changed. Prior size was 5781250 and the new default is below. Disabling Fastnode also changed. Prior was set to false and current new configs set it to true

# IavlCacheSize set the size of the iavl tree cache.
# Default cache size is 50mb.
iavl-cache-size = 781250

# IAVLDisableFastNode enables or disables the fast node feature of IAVL.
# Default is true.
iavl-disable-fastnode = true

These settings do not exist in old configs but are apart of the new configs when generated.

# EXPERIMENTAL: IAVLLazyLoading enable/disable the lazy loading of iavl store.
# Default is false.
iavl-lazy-loading = false

As well as this (These are found below state sync)

###############################################################################
###                         Store / State Streaming                         ###
###############################################################################

[store]
streamers = []

[streamers]
[streamers.file]
keys = ["*", ]
write_dir = ""
prefix = ""

# output-metadata specifies if output the metadata file which includes the abci request/responses
# during processing the block.
output-metadata = "true"

# stop-node-on-error specifies if propagate the file streamer errors to consensus state machine.
stop-node-on-error = "true"

# fsync specifies if call fsync after writing the files.
fsync = "false"

I have switch to using pebbledb, and it's advised to also set this when using a different database backend. These are not present in anyway by default.

# AppDBBackend defines the database backend type to use for the application and snapshots DBs.
# An empty string indicates that a fallback will be used.
# First fallback is the deprecated compile-time types.DBBackend value.
# Second fallback (if the types.DBBackend also isn't set), is the db-backend value set in Tendermint's config.toml.

app-db-backend = "pebbledb"

WASM/State sync

I have cleared my addrboook and data dir several times, using state sync from several different providers. This has not resulted in better up times or any change that would signify a benefit against the problem at hand.

I contacted several validators, and asked them to execute the following command to compare wasm folders. tree ~/.canined/wasm

I contacted others to compare because, everytime I cleared my cache for WASM it was recreated. I discovered that other validators were not generating a WASM cache. In one instance, with Pace, he did not have the WASM contract that all the other validators did that I asked to check. Of those that I asked, we all hashed the same WASM contract and so I wasn't really concerned about it.

However, Pace is not having consensus errors/missed blocks like the others I contacted. And this leaves me to wonder if the WASM contract is causing issues for those still containing it? And if it's not being called, why am I still receiving it from State Sync?

I have not tried removing the wasm contract and running my node yet.

Pruning

Sometimes, the blocks being missed were had an obvious perodcity about them. This suggested to me that pruning settings could be interfering, causing delays for vals who can't prune fast enough to their config. I experimented with pruning, running custom, default and nothing. I never saw a noticeable positive difference from the changes unfortunately.

Database type

Changing the database type yielding a large write speed benefit, which is most of what a validator does. Despite the improvement, no real differences were seen in my nodes ability to sign blocks. The node tends to stay at the "TIP" of the chain, failing to provide the signature fast enough prior to the proposer committing. Screenshot 2024-04-08 120524 The spike 2/3 in the graph is where the change from gocleveldb to pebbledb was.

Various other configuration changes

All my config files. erialos-config-files.zip

effofxprime commented 7 months ago

Golang

Golang docs specifically mention how to spot issues with possible memory leaks as well as a solution. https://go.dev/doc/gc-guide#Linux_transparent_huge_pages

My kernel does not have Transparent Huge Pages built into it as I am on a realtime kernel with Ubuntu Pro LTS. While I don't run out of memory, memory usage since the upgrade appears to have characteristics from usage graphs that there could be a 'leak'.

image

I am currently running go version 1.21.9. It might be worth trying prior version of 1.21.6.

High tx's or invalid txs

During times we have issues, most commonly missed block type will have:

effofxprime commented 7 months ago

Since the IBC GO Patch upgrade, 2/3+ more of VP have upgraded. The network performance and missed blocks have gone done for most and over all have been reduced.

It appears the troubles were coming from IBC GO.

Some clarifications of the above. On restarts, nodes will switch from blocksync to consensus correctly. When running already, this doesn't happen.

Reducing how often pruning happens has reduced missed blocks as well. Prior, I had tried pruning nothing but noticed no positive result. I suspect that when the epoch hits, if you try to prune during this you're likely to miss. I've set to prune every 1000 blocks. Keep 3000.

effofxprime commented 6 months ago

At first, the update to IBC GO appeared to resolve the issue once more made the upgrade. However, the problem began to present itself again and has been happening on and off like it was before.

effofxprime commented 4 months ago

Below are charts from a tool I made to analyze block sizes. https://github.com/effofxprime/Block-Buster-Analyzer

Since my last update, the network had a period in May where the congestion issues improved. The scatter plot shows we were generating blocks below 2Mb in size. We also had a time in June where the congestion was less, and that is also seen in the scatter plot.

In the time in between April and now, I have found a few settings that have been helpful. While not a solution that completely resolves the problem, in many instances it alleviated the issue for our node.

in the config.toml, under mempool settings. This crazy shit helped. cache_size = 3000000000 I continued to increase this by adding a 0 each restart. When I got to this level for the setting above, I noticed that while the binary ran, it allocated more RAM for holding whatever it needs in memory. Prior to this, with default settings, Jackal was only attempting to consume ~3Gb ram Max. This change resulted in the Binary consuming up to 70Gb of ram now(when it needs). I had to monitor this closely, as I reached a point where it did attempt to consume all ram.

Because blocktime can be slow at times, it's worth updating some timeout settings regarding with RPC type functions. A few exist in the app.toml and config.toml. read/write timeouts are 10s default IIRC? I have changed those to 90s. Our nodes will query via RPC, and since we know from the scatter plot that we are making blocks 3Mb, 5mb and even larger, pulling all that JSON data just takes longer.

Also, since we are now more aware of our block sizes, it would probably be a beneficial for everyone to increase their send_rate and recv_rate settings in config.toml. Or to have it hardcoded higher. I have set mine to: 15728640 for both settings. This is 15Mb in base 2 bytes.

The recommendation I would have from here is to pass a proposal to change our allowed blocksize via governance. I think a good place to start would be to consider 3Mb - 5Mb max block size. This may have an adverse effect where the mempool could wind up holding a lot of tx's in wait as blocks are quickly being filled to 3mb or whatever limit may be imposed. This can be combated by choosing to lower our block time to say, 3s. That should help to relieve any excessive mempool pressure during congestion. We may not need nodes to increase their send/recv rates if we limit the block size as well. Querying blocks wouldn't need to rely on a larger timeout also since the data size would be capped. (it might still need to be increase but not as drastic, testing would be needed to know exactly)

block_sizes_7000000_to_8225100_20240702_172446_segmented_bar_chart block_sizes_7000000_to_8225100_20240702_172446_enhanced_scatter_chart block_sizes_7000000_to_8225100_20240702_172446_scatter_chart