Tips for running a BSC full node

unclezoro commented 2 years ago

Some of the enhancements below can address the existing challenges with running a BSC full node:

Binary

All the clients are suggested to upgrade to the latest release. The latest version is supposed to be more stable and get better performance.

Storage

According to the test, the performance of a fullnoded will degrade when the storage size exceeds 1.5T. We suggest the fullnode always keeps light storage by pruning the storage.

Following are the steps to do prune:

Stop the BSC node first.
Run nohup geth snapshot prune-state --datadir {the data dir of your bsc node} &. It will take 3-5 hours to finish.
Start the node once it is done.

The maintainers should always have a few backup nodes so that you can switch to the backup ones when one of them is pruning.

The hardware is also important, make sure the SSD meets: 2T GB of free disk space, solid-state drive(SSD), gp3, 8k IOPS, 250MB/S throughput, read latency <1ms.

Light Storage

When the node crashes or is force killed, the node will sync from a block that was a few minutes or a few hours ago. This is because the state in memory is not persisted into the database in real time, and the node needs to replay blocks from the last checkpoint. The replaying time dependents on the configuration TrieTimeout in the config.toml. We suggest you raise it if you can tolerate with long replaying time, so the node can keep light storage.

Performance Tuning

In the logs, mgasps means the block processing ability of the fullnode, make sure the value is above 50.

The node can enable the profile function by —pprof

Profile by curl -sK -v http://127.0.0.1:6060/debug/pprof/profile?seconds=60 > profile_60s.out, and the dev community can help to analyze the performance.

New Node

If you build a new BSC node, please fetch snapshot from: https://github.com/binance-chain/bsc-snapshots

psdlt commented 2 years ago

@guagualvcha thank you. Could you please elaborate on what exactly DisablePeerTxBroadcast changes? Reading through the code the only usage I can find is here. Which seems to me that if DisablePeerTxBroadcast is set to true then our node will not be receiving notifications about pending transactions? Am I missing something?

ghost commented 2 years ago

ERROR[11-02|06:02:55.001] Failed to open snapshot tree err="head doesn't match snapshot: have 0x5c17a8fc0164dabedd446e954b64e8a54fc7c8b4fee1bbd707c3cc3ed1e45fff, want 0x431565cee8b7f3d7bbdde1265304fa4574dc3531e511e9ffe43ae79d28e431d6" head doesn't match snapshot: have 0x5c17a8fc0164dabedd446e954b64e8a54fc7c8b4fee1bbd707c3cc3ed1e45fff, want 0x431565cee8b7f3d7bbdde1265304fa4574dc3531e511e9ffe43ae79d28e431d6

vae520283995 commented 2 years ago

@guagualvcha https://github.com/binance-chain/bsc/issues#issuecomment-956215679 So do we need to wait until the next version to enable diff sync?

Nojoix commented 2 years ago

I dont want to be rude here, but bsc is in real danger. These past couple of weeks was a nightmare for me: cant resync. I started digging, got in relation with admins etc and it's not just me. Geth 1.1.3 was a nightmare and 1.1.4 not helping much, the solution you give here doesnt solve anything. If you guys dont figure out the syncing issue with a proper patch we dont have a bring future. Yesturday i tried to dl the EU snap, it was corrupted and the full state of today seems corrupted also (reteying a download)

de-ltd commented 2 years ago

I dont want to be rude here, but bsc is in real danger. These past couple of weeks was a nightmare for me: cant resync. I started digging, got in relation with admins etc and it's not just me. Geth 1.1.3 was a nightmare and 1.1.4 not helping much, the solution you give here doesnt solve anything. If you guys dont figure out the syncing issue with a proper patch we dont have a bring future. Yesturday i tried to dl the EU snap, it was corrupted and the full state of today seems corrupted also (reteying a download)

Agreed, its a total mess.

unclezoro commented 2 years ago

@guagualvcha thank you. Could you please elaborate on what exactly DisablePeerTxBroadcast changes? Reading through the code the only usage I can find is here. Which seems to me that if DisablePeerTxBroadcast is set to true then our node will not be receiving notifications about pending transactions? Am I missing something?

Ethereum is a grid network, while BSC is an areatus network. It means the transactions flow from different fullnodes all around the world to the 21 validators. Usually validators are guarded by a sentry node who joins the network directly. As the transaction volume on BSC is much larger, the sentry nodes are under pressure dealing with the transaction exchange protocol. We extend the protocol so that any full nodes can claim that they are not interested in pending transactions as they are not validator/miner, it will save a lot of network and calculation resources. It can be enabled by adding DisablePeerTxBroadcast = true under the [Eth] module of the config.tom file.

unclezoro commented 2 years ago

@guagualvcha https://github.com/binance-chain/bsc/issues#issuecomment-956215679 So do we need to wait until the next version to enable diff sync?

No need to wait.

unclezoro commented 2 years ago

I dont want to be rude here, but bsc is in real danger. These past couple of weeks was a nightmare for me: cant resync. I started digging, got in relation with admins etc and it's not just me. Geth 1.1.3 was a nightmare and 1.1.4 not helping much, the solution you give here doesnt solve anything. If you guys dont figure out the syncing issue with a proper patch we dont have a bring future. Yesturday i tried to dl the EU snap, it was corrupted and the full state of today seems corrupted also (reteying a download)

sorry about that. As I know the ops is uploading a new snapshot after they are aware of that, some monitors to ensure the data integrity. For the syncing issue, would you open the pprof port of your nodes and do curl -sK -v http://127.0.0.1:6060/debug/pprof/profile?seconds=60 > profile_60s.out, upload the profile file, I could help to check.

vae520283995 commented 2 years ago

No need to wait.

Is v1.1.3 or v1.1.4 recommended now?

ghost commented 2 years ago

Is v1.1.3 or v1.1.4 recommended now?

I have upgrade to 1.1.4, and started node with snapshot data.

vae520283995 commented 2 years ago

@guagualvcha Is the pruning done?

INFO [11-04|14:01:41.317] Pruning state data                       nodes=6,918,707,201 size=1.94TiB    elapsed=8h11m1.703s  eta=55.513s
INFO [11-04|14:01:49.318] Pruning state data                       nodes=6,920,547,567 size=1.94TiB    elapsed=8h11m9.704s  eta=47.676s
INFO [11-04|14:01:57.318] Pruning state data                       nodes=6,922,390,198 size=1.94TiB    elapsed=8h11m17.704s eta=39.822s
INFO [11-04|14:02:05.320] Pruning state data                       nodes=6,924,202,706 size=1.95TiB    elapsed=8h11m25.706s eta=32.105s
INFO [11-04|14:02:13.320] Pruning state data                       nodes=6,926,075,421 size=1.95TiB    elapsed=8h11m33.706s eta=24.125s
INFO [11-04|14:02:21.324] Pruning state data                       nodes=6,927,973,074 size=1.95TiB    elapsed=8h11m41.710s eta=16.051s
INFO [11-04|14:02:29.324] Pruning state data                       nodes=6,929,789,240 size=1.95TiB    elapsed=8h11m49.710s eta=8.311s
INFO [11-04|14:02:37.327] Pruning state data                       nodes=6,931,501,962 size=1.95TiB    elapsed=8h11m57.713s eta=1.019s
INFO [11-04|14:02:38.439] Pruned state data                        nodes=6,931,741,625 size=1.95TiB    elapsed=8h11m58.825s
INFO [11-04|14:02:41.037] Compacting database                      range=0x00-0x10 elapsed="3.329µs"

Lajoix commented 2 years ago

sorry about that. As I know the ops is uploading a new snapshot after they are aware of that, some monitors to ensure the data integrity. For the syncing issue, would you open the pprof port of your nodes and do curl -sK -v http://127.0.0.1:6060/debug/pprof/profile?seconds=60 > profile_60s.out, upload the profile file, I could help to check.

well, mirrors are gone, and for now i dont have a node as i can't sync them from genesis (and can't download snapshots).

I tried your method again and got (tryed twice): Failed to store fast sync trie progress err="leveldb/table: corruption on data-block (pos=2051287): checksum mismatch, want=0x548fe288 got=0x525970c2 [file=060694.ldb]"

To be honest, it's very confusing, i used to fast resync from genesis, it was 6-9hours to import state entries then 3 days to download everything.

jcaffet commented 2 years ago

I dont want to be rude here, but bsc is in real danger. These past couple of weeks was a nightmare for me: cant resync. I started digging, got in relation with admins etc and it's not just me. Geth 1.1.3 was a nightmare and 1.1.4 not helping much, the solution you give here doesnt solve anything. If you guys dont figure out the syncing issue with a proper patch we dont have a bring future. Yesturday i tried to dl the EU snap, it was corrupted and the full state of today seems corrupted also (reteying a download)

I totally agree with that as we face exactly the same scenario. What is the worst is the total lack of communication in comparaison with the size of the project.

psdlt commented 2 years ago

@Lajoix @jcaffet folks, what kind of filing systems do you use on your servers? I've read somewhere that xfs is better than ext4 (sorry, don't remember where; don't have a link). Also, do you use single disk or RAID? I have one server on AWS (i3en.2xlarge, RAID0, xfs), been running it for ~half a year, never had issues you're describing. Few days ago I've setup another server on Vultr (also RAID0, also xfs) - it fast synced to latest block from scratch in under a day.

If you're constantly having sync issues and can't catch up to network - review your hardware setup, maybe spin up a different instance on a different region, maybe you just got a busy host, who knows.

jcaffet commented 2 years ago

@Lajoix @jcaffet folks, what kind of filing systems do you use on your servers? I've read somewhere that xfs is better than ext4 (sorry, don't remember where; don't have a link). Also, do you use single disk or RAID? I have one server on AWS (i3en.2xlarge, RAID0, xfs), been running it for ~half a year, never had issues you're describing. Few days ago I've setup another server on Vultr (also RAID0, also xfs) - it fast synced to latest block from scratch in under a day.

If you're constantly having sync issues and can't catch up to network - review your hardware setup, maybe spin up a different instance on a different region, maybe you just got a busy host, who knows.

Thanks for your feedback. We have aws i3en.xlarge instances with xfs ... but no raid0 yet. We were using ext4 and also recently moved to xfs (infos in https://github.com/binance-chain/bsc/issues/189). We have a node for months but we have faced issues since mid-last week.

Lajoix commented 2 years ago

@Lajoix @jcaffet folks, what kind of filing systems do you use on your servers? I've read somewhere that xfs is better than ext4 (sorry, don't remember where; don't have a link). Also, do you use single disk or RAID? I have one server on AWS (i3en.2xlarge, RAID0, xfs), been running it for ~half a year, never had issues you're describing. Few days ago I've setup another server on Vultr (also RAID0, also xfs) - it fast synced to latest block from scratch in under a day.

If you're constantly having sync issues and can't catch up to network - review your hardware setup, maybe spin up a different instance on a different region, maybe you just got a busy host, who knows.

I'm running a 24 cores 64 gb 1Tb NVMe ssd on ubuntu with 1gbps connection. I'm in ext4, but i'm not sure changing to xfs could really make the difference. I used to resync from genesis before with no issue. I just tried another fastsync from genesis and got this error :

Failed to update chain markers error="leveldb/table: corruption on data-block (pos=1359583): checksum mismatch, want=0xdeca7719 got=0xc3696716 [file=351971.ldb]"

i'm very confused to be honest.

Cwsor commented 2 years ago

@Lajoix @jcaffet folks, what kind of filing systems do you use on your servers? I've read somewhere that xfs is better than ext4 (sorry, don't remember where; don't have a link). Also, do you use single disk or RAID? I have one server on AWS (i3en.2xlarge, RAID0, xfs), been running it for ~half a year, never had issues you're describing. Few days ago I've setup another server on Vultr (also RAID0, also xfs) - it fast synced to latest block from scratch in under a day. If you're constantly having sync issues and can't catch up to network - review your hardware setup, maybe spin up a different instance on a different region, maybe you just got a busy host, who knows.

Thanks for your feedback. We have aws i3en.xlarge instances with xfs ... but no raid0 yet. We were using ext4 and also recently moved to xfs (infos in #189). We have a node for months but we have faced issues since mid-last week.

This timeline lines up with my experience as well. Node ran fine for months until about a week ago. Current server is ryzen 5950 with 128gb ram + nvme ssd on raid1. Seen others have no issues with lesser specs, but I have been unable to keep up with the current block, always about 50 behind.

unclezoro commented 2 years ago

@guagualvcha Is the pruning done?

INFO [11-04|14:01:41.317] Pruning state data                       nodes=6,918,707,201 size=1.94TiB    elapsed=8h11m1.703s  eta=55.513s
INFO [11-04|14:01:49.318] Pruning state data                       nodes=6,920,547,567 size=1.94TiB    elapsed=8h11m9.704s  eta=47.676s
INFO [11-04|14:01:57.318] Pruning state data                       nodes=6,922,390,198 size=1.94TiB    elapsed=8h11m17.704s eta=39.822s
INFO [11-04|14:02:05.320] Pruning state data                       nodes=6,924,202,706 size=1.95TiB    elapsed=8h11m25.706s eta=32.105s
INFO [11-04|14:02:13.320] Pruning state data                       nodes=6,926,075,421 size=1.95TiB    elapsed=8h11m33.706s eta=24.125s
INFO [11-04|14:02:21.324] Pruning state data                       nodes=6,927,973,074 size=1.95TiB    elapsed=8h11m41.710s eta=16.051s
INFO [11-04|14:02:29.324] Pruning state data                       nodes=6,929,789,240 size=1.95TiB    elapsed=8h11m49.710s eta=8.311s
INFO [11-04|14:02:37.327] Pruning state data                       nodes=6,931,501,962 size=1.95TiB    elapsed=8h11m57.713s eta=1.019s
INFO [11-04|14:02:38.439] Pruned state data                        nodes=6,931,741,625 size=1.95TiB    elapsed=8h11m58.825s
INFO [11-04|14:02:41.037] Compacting database                      range=0x00-0x10 elapsed="3.329µs"

yes

charliedimaggio commented 2 years ago

I dont want to be rude here, but bsc is in real danger. These past couple of weeks was a nightmare for me: cant resync. I started digging, got in relation with admins etc and it's not just me. Geth 1.1.3 was a nightmare and 1.1.4 not helping much, the solution you give here doesnt solve anything. If you guys dont figure out the syncing issue with a proper patch we dont have a bring future. Yesturday i tried to dl the EU snap, it was corrupted and the full state of today seems corrupted also (reteying a download)

sorry about that. As I know the ops is uploading a new snapshot after they are aware of that, some monitors to ensure the data integrity. For the syncing issue, would you open the pprof port of your nodes and do curl -sK -v http://127.0.0.1:6060/debug/pprof/profile?seconds=60 > profile_60s.out, upload the profile file, I could help to check.

Would it be possible to have some guidance on what tools, if any, we can use to analyse the profile_60s.out file ourselves?

Sharp-Lee commented 2 years ago

@guagualvcha thank you. Could you please elaborate on what exactly DisablePeerTxBroadcast changes? Reading through the code the only usage I can find is here. Which seems to me that if DisablePeerTxBroadcast is set to true then our node will not be receiving notifications about pending transactions? Am I missing something?

Ethereum is a grid network, while BSC is an areatus network. It means the transactions flow from different fullnodes all around the world to the 21 validators. Usually validators are guarded by a sentry node who joins the network directly. As the transaction volume on BSC is much larger, the sentry nodes are under pressure dealing with the transaction exchange protocol. We extend the protocol so that any full nodes can claim that they are not interested in pending transactions as they are not validator/miner, it will save a lot of network and calculation resources. It can be enabled by adding DisablePeerTxBroadcast = true under the [Eth] module of the config.tom file.

so,its means I cannot subscribe pendingTransactions?

barryz commented 2 years ago

After upgrading to version v1.1.3 and running the service with your suggested settings. I have stuck in syncing, error log in brief shows as below:

lvl=eror msg="\n########## BAD BLOCK #########\nChain config: {ChainID: 56 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 Constantinople: 0 Petersburg: 0 Istanbul: 0, Muir Glacier: 0, Ramanujan: 0, Niels: 0, MirrorSync: 5184000, Berlin: <nil>, YOLO v3: <nil>, Engine: parlia}\n\nNumber: 12384585\nHash: 0xc8f9d3fea1fe05242ed575dec20dbf781a06b978ca3de108fdb2cda4cbae9636\n\t 0: cumulative: 21139 gas: 21139 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0xe2059bfb1cc59b2a45887f5cc6e220884b7bb3ae4d2e1702d9a2d5ff17680c1d logs 
... ...
expected tx hash 0xd5a6db9b741519b1332fd32da482c5ddb49d97a361ef8868d256698995c9e871, get 0x78b01570f90ec93f244d3680075db0aca374fc5c54710c8564669b17e6c54099, nonce 613215, to 0x0000000000000000000000000000000000001000, value 206954075983563310, gas 9223372036854775807, gasPrice 0, data f340fa010000000000000000000000002d4c407bbe49438ed859fe965b140dcf1aab71a9\n##############################\n

hdiass commented 2 years ago

I dont want to be rude here, but bsc is in real danger. These past couple of weeks was a nightmare for me: cant resync. I started digging, got in relation with admins etc and it's not just me. Geth 1.1.3 was a nightmare and 1.1.4 not helping much, the solution you give here doesnt solve anything. If you guys dont figure out the syncing issue with a proper patch we dont have a bring future. Yesturday i tried to dl the EU snap, it was corrupted and the full state of today seems corrupted also (reteying a download)

I totally agree with that as we face exactly the same scenario. What is the worst is the total lack of communication in comparaison with the size of the project.

Totally agree

gomes7997 commented 2 years ago

I'm following all of these recommendations, and I'm using the recommended AWS hardware for a validator node, even though I'm only operating a regular node. My node still can't catch up after starting from the latest snapshot this morning. The likely problem is that there are not enough healthy nodes in the network providing blocks to nodes that go out of sync. Do you have any suggestions for this problem? Can Binance provide healthy nodes to ensure that others can sync?

Crypto2 commented 2 years ago

It's probably more of a problem of too many blocks and/or too large a size and it takes too long to process them versus any propagation issues.

gomes7997 commented 2 years ago

No it's not. I've regularly seen higher processing throughput on the same node hardware under healthier network conditions. It's reported as mgasps, gas units per second during the "imported new chain segment message." .Increasing the block size wouldn't change the node's rate of processing gas units. The bottleneck is that the node likely isn't receiving enough new blocks to reach it's throughput capacity. I'm using the recommended validator hardware.

Sharp-Lee commented 2 years ago

@Lajoix @jcaffet folks, what kind of filing systems do you use on your servers? I've read somewhere that xfs is better than ext4 (sorry, don't remember where; don't have a link). Also, do you use single disk or RAID? I have one server on AWS (i3en.2xlarge, RAID0, xfs), been running it for ~half a year, never had issues you're describing. Few days ago I've setup another server on Vultr (also RAID0, also xfs) - it fast synced to latest block from scratch in under a day.

If you're constantly having sync issues and can't catch up to network - review your hardware setup, maybe spin up a different instance on a different region, maybe you just got a busy host, who knows.

can you share your start command？please!

psdlt commented 2 years ago

@Sharp-Lee nothing special about my startup: geth --config config.toml --datadir /data --cache 16000

Sharp-Lee commented 2 years ago

@Sharp-Lee nothing special about my startup: geth --config config.toml --datadir /data --cache 16000

thanks!

vae520283995 commented 2 years ago

After running the tip for two days, the node is out of sync again

litebarb commented 2 years ago

Hi guys. I have been running a BSC node for months without any issues. Two months ago after pruning, the database size was around 600gb. Right now, the database size after pruning is around 770gb. As I have only around 900gb of usable SSD space, I have turn off my node to prune every few days, and this will only become increasingly often (until real-time pruning is available). I noticed the snapshots are also sized around ~750gb. Is there any commands I can use to execute 'heavier-duty' pruning to compact the database down to 600gb or less? Also, other than the TrieTimeout command, what other commands should/can I use to prevent the database size from increasing too quickly? Currently TrieTimeout = 80000000000000 for me, which is already a pretty long duration before persisting happens.

quchenhao commented 2 years ago

@guagualvcha thank you. Could you please elaborate on what exactly DisablePeerTxBroadcast changes? Reading through the code the only usage I can find is here. Which seems to me that if DisablePeerTxBroadcast is set to true then our node will not be receiving notifications about pending transactions? Am I missing something?

Ethereum is a grid network, while BSC is an areatus network. It means the transactions flow from different fullnodes all around the world to the 21 validators. Usually validators are guarded by a sentry node who joins the network directly. As the transaction volume on BSC is much larger, the sentry nodes are under pressure dealing with the transaction exchange protocol. We extend the protocol so that any full nodes can claim that they are not interested in pending transactions as they are not validator/miner, it will save a lot of network and calculation resources. It can be enabled by adding DisablePeerTxBroadcast = true under the [Eth] module of the config.tom file.

Then if only the validators are receiving transactions, there is a risk that a transaction can never reach a validator because no one on the path broadcast it to the next node. How do you solve that issue?

Marijus commented 2 years ago

Do you need to initialize genesis when trying to sync from snapshot?
Do you need to provide --syncmode snap --snapshot="true" when trying to sync from snapshot?

kkbao13 commented 2 years ago

Hi guys. I have been running a BSC node for months without any issues. Two months ago after pruning, the database size was around 600gb. Right now, the database size after pruning is around 770gb. As I have only around 900gb of usable SSD space, I have turn off my node to prune every few days, and this will only become increasingly often (until real-time pruning is available). I noticed the snapshots are also sized around ~750gb. Is there any commands I can use to execute 'heavier-duty' pruning to compact the database down to 600gb or less? Also, other than the TrieTimeout command, what other commands should/can I use to prevent the database size from increasing too quickly? Currently TrieTimeout = 80000000000000 for me, which is already a pretty long duration before persisting happens.

Agree,my node size is around 2T, 🤷‍♂️

Crypto2 commented 2 years ago

@quchenhao We personally broadcast each TX to both our node and the Binance-run public node, it helps a lot with TXes that previously just disappeared into the ether.

quchenhao commented 2 years ago

@quchenhao We personally broadcast each TX to both our node and the Binance-run public node, it helps a lot with TXes that previously just disappeared into the ether.

With this change, bsc is becoming more and more centralized as only a few nodes in the whole network are processing the transactions.

Crypto2 commented 2 years ago

Yeah it's not good but otherwise it's a bunch of user complaints :(

nav1d commented 2 years ago

Some of the enhancements below can address the existing challenges with running a BSC full node:

Prune the state. To prune the stale data and make the storage lighter:

Stop the bsc node first.

Run nohup geth snapshot prune-state --datadir {the data dir of your bsc node} &. It will take 3-5 hours to finish.

Start the node once it is done.

Enable diff sync and disable unnecessary transaction exchange.

Stop the bsc node first.

Add DisablePeerTxBroadcast = true under the [Eth] module of the config.tom file.

Add --diffsync to the start command.

Start the node.

If you build a new bsc node, please fetch snapshot from: https://github.com/binance-chain/bsc-snapshots

Please give us more info about "diffsync" functionlitiy since it says: security will be downgraded to light client... i don't understand what is this means. I would be happy if you tell me is "DisablePeerTxBroadcast" option has any effect on submitted tx which are done by RPC protocol or not?!

kkbao13 commented 2 years ago

Hi guys. I have been running a BSC node for months without any issues. Two months ago after pruning, the database size was around 600gb. Right now, the database size after pruning is around 770gb. As I have only around 900gb of usable SSD space, I have turn off my node to prune every few days, and this will only become increasingly often (until real-time pruning is available). I noticed the snapshots are also sized around ~750gb. Is there any commands I can use to execute 'heavier-duty' pruning to compact the database down to 600gb or less? Also, other than the TrieTimeout command, what other commands should/can I use to prevent the database size from increasing too quickly? Currently TrieTimeout = 80000000000000 for me, which is already a pretty long duration before persisting happens.

I pruned the node, but it's size still was 2T. How to do? What should delete?

litebarb commented 2 years ago

Hi guys. I have been running a BSC node for months without any issues. Two months ago after pruning, the database size was around 600gb. Right now, the database size after pruning is around 770gb. As I have only around 900gb of usable SSD space, I have turn off my node to prune every few days, and this will only become increasingly often (until real-time pruning is available). I noticed the snapshots are also sized around ~750gb. Is there any commands I can use to execute 'heavier-duty' pruning to compact the database down to 600gb or less? Also, other than the TrieTimeout command, what other commands should/can I use to prevent the database size from increasing too quickly? Currently TrieTimeout = 80000000000000 for me, which is already a pretty long duration before persisting happens.

I pruned the node, but it's size still was 2T. How to do? What should delete?

@kkbao13 I used the standard prune command, nothing fanciful: ./geth --datadir ./node snapshot prune-state I think it's because you are running an ARCHIVE node (--gcmode=archive). I am running a full, non-archive node (--gcmode=full).

unclezoro commented 2 years ago

@guagualvcha thank you. Could you please elaborate on what exactly DisablePeerTxBroadcast changes? Reading through the code the only usage I can find is here. Which seems to me that if DisablePeerTxBroadcast is set to true then our node will not be receiving notifications about pending transactions? Am I missing something?

Ethereum is a grid network, while BSC is an areatus network. It means the transactions flow from different fullnodes all around the world to the 21 validators. Usually validators are guarded by a sentry node who joins the network directly. As the transaction volume on BSC is much larger, the sentry nodes are under pressure dealing with the transaction exchange protocol. We extend the protocol so that any full nodes can claim that they are not interested in pending transactions as they are not validator/miner, it will save a lot of network and calculation resources. It can be enabled by adding DisablePeerTxBroadcast = true under the [Eth] module of the config.tom file.

Then if only the validators are receiving transactions, there is a risk that a transaction can never reach a validator because no one on the path broadcast it to the next node. How do you solve that issue?

That is why this is optional. The bootstrap nodes, public p2p nodes wont enable this.

acswap commented 2 years ago

@guagualvcha Is the pruning done?

INFO [11-04|14:01:41.317] Pruning state data                       nodes=6,918,707,201 size=1.94TiB    elapsed=8h11m1.703s  eta=55.513s
INFO [11-04|14:01:49.318] Pruning state data                       nodes=6,920,547,567 size=1.94TiB    elapsed=8h11m9.704s  eta=47.676s
INFO [11-04|14:01:57.318] Pruning state data                       nodes=6,922,390,198 size=1.94TiB    elapsed=8h11m17.704s eta=39.822s
INFO [11-04|14:02:05.320] Pruning state data                       nodes=6,924,202,706 size=1.95TiB    elapsed=8h11m25.706s eta=32.105s
INFO [11-04|14:02:13.320] Pruning state data                       nodes=6,926,075,421 size=1.95TiB    elapsed=8h11m33.706s eta=24.125s
INFO [11-04|14:02:21.324] Pruning state data                       nodes=6,927,973,074 size=1.95TiB    elapsed=8h11m41.710s eta=16.051s
INFO [11-04|14:02:29.324] Pruning state data                       nodes=6,929,789,240 size=1.95TiB    elapsed=8h11m49.710s eta=8.311s
INFO [11-04|14:02:37.327] Pruning state data                       nodes=6,931,501,962 size=1.95TiB    elapsed=8h11m57.713s eta=1.019s
INFO [11-04|14:02:38.439] Pruned state data                        nodes=6,931,741,625 size=1.95TiB    elapsed=8h11m58.825s
INFO [11-04|14:02:41.037] Compacting database                      range=0x00-0x10 elapsed="3.329µs"

nohup geth snapshot prune-state --datadir /root/node/data/geth/chaindata &

This is the pruning command I used, but it didn't work. Is my directory path wrong? Can I see the trim command you used? Make a reference

jcaffet commented 2 years ago

@Lajoix @jcaffet folks, what kind of filing systems do you use on your servers? I've read somewhere that xfs is better than ext4 (sorry, don't remember where; don't have a link). Also, do you use single disk or RAID? I have one server on AWS (i3en.2xlarge, RAID0, xfs), been running it for ~half a year, never had issues you're describing. Few days ago I've setup another server on Vultr (also RAID0, also xfs) - it fast synced to latest block from scratch in under a day. If you're constantly having sync issues and can't catch up to network - review your hardware setup, maybe spin up a different instance on a different region, maybe you just got a busy host, who knows.

can you share your start command？please!

We have tried both : geth --config "/etc/bsc/config.toml" --datadir "/data" --cache "18000" --syncmode fast --snapshot="false" --rpc.allow-unprotected-txs --txlookuplimit "0" --http --http.addr "0.0.0.0" --http.vhosts "" --http.corsdomain ""

geth --config "/etc/bsc/config.toml" --datadir "/data" --cache "18000" --syncmode snap --snapshot="true" --rpc.allow-unprotected-txs --txlookuplimit "0" --http --http.addr "0.0.0.0" --http.vhosts "" --http.corsdomain ""

Our experience :

snapshot mode is quite fast but very sensitive to "BAD BLOCKS" issues which crash the process
fast mode is too slow to reach latest node

Marijus commented 2 years ago

@Lajoix @jcaffet folks, what kind of filing systems do you use on your servers? I've read somewhere that xfs is better than ext4 (sorry, don't remember where; don't have a link). Also, do you use single disk or RAID? I have one server on AWS (i3en.2xlarge, RAID0, xfs), been running it for ~half a year, never had issues you're describing. Few days ago I've setup another server on Vultr (also RAID0, also xfs) - it fast synced to latest block from scratch in under a day.

If you're constantly having sync issues and can't catch up to network - review your hardware setup, maybe spin up a different instance on a different region, maybe you just got a busy host, who knows.

Which Vultr server did you deploy? Looks like the largest NVMe they have is 768gb which is not enough for a node.

psdlt commented 2 years ago

@Marijus their server offerings differ based on region and selected OS. Check Bare Metal > Debian 11 > Silicon Valley. Some other regions too, not all though.

Marijus commented 2 years ago

@psdlt That's great! Thanks! Did you simply run the fast sync command (without the snapshot)? Something similar to:

geth --config ./config.toml --datadir ./node --cache 8000 --rpc.allow-unprotected-txs --txlookuplimit 0

psdlt commented 2 years ago

@Marijus yes, just a basic start-up without snapshot. Regarding Vultr, you should be aware that their 2xNVMe does mean just that - 2xNVMe (as opposed to AWS where you get small EBS volume for system and then 2xNVMe). Furthermore, their UI doesn't let you set-up instance with RAID0, so you have do the whole iPXE setup dance and install your OS of choice manually, configure RAID0 during install, etc. Not the end of the world, but took me some googling to figure it out.

acswap commented 2 years ago

@guagualvcha Is the pruning done?

INFO [11-04|14:01:41.317] Pruning state data                       nodes=6,918,707,201 size=1.94TiB    elapsed=8h11m1.703s  eta=55.513s
INFO [11-04|14:01:49.318] Pruning state data                       nodes=6,920,547,567 size=1.94TiB    elapsed=8h11m9.704s  eta=47.676s
INFO [11-04|14:01:57.318] Pruning state data                       nodes=6,922,390,198 size=1.94TiB    elapsed=8h11m17.704s eta=39.822s
INFO [11-04|14:02:05.320] Pruning state data                       nodes=6,924,202,706 size=1.95TiB    elapsed=8h11m25.706s eta=32.105s
INFO [11-04|14:02:13.320] Pruning state data                       nodes=6,926,075,421 size=1.95TiB    elapsed=8h11m33.706s eta=24.125s
INFO [11-04|14:02:21.324] Pruning state data                       nodes=6,927,973,074 size=1.95TiB    elapsed=8h11m41.710s eta=16.051s
INFO [11-04|14:02:29.324] Pruning state data                       nodes=6,929,789,240 size=1.95TiB    elapsed=8h11m49.710s eta=8.311s
INFO [11-04|14:02:37.327] Pruning state data                       nodes=6,931,501,962 size=1.95TiB    elapsed=8h11m57.713s eta=1.019s
INFO [11-04|14:02:38.439] Pruned state data                        nodes=6,931,741,625 size=1.95TiB    elapsed=8h11m58.825s
INFO [11-04|14:02:41.037] Compacting database                      range=0x00-0x10 elapsed="3.329µs"

yes

nohup geth snapshot prune-state --datadir /root/node/data/geth/chaindata &

This is the pruning command I used, but it didn't work. Is my directory path wrong?

barryz commented 2 years ago

@acswap Try to use this command: nohup geth snapshot prune-state --datadir /root/node/data. Hopefully, it can help you.

kkbao13 commented 2 years ago

@acswap Try to use this command: nohup geth snapshot prune-state --datadir /root/node/data. Hopefully, it can help you.

I pruned but the node size still is 2 T. What is the command to start the node? I have to drop the 2T node. I download the full snapshot and my new command: bsc --config ./config.toml --datadir ./node --http --diffsync -snapshot=false --gcmode=full --cache 8000 --rpc.allow-unprotected-txs --txlookuplimit 0

acswap commented 2 years ago

@acswap Try to use this command: nohup geth snapshot prune-state --datadir /root/node/data. Hopefully, it can help you.

I pruned but the node size still is 2 T. What is the command to start the node? I have to drop the 2T node. I download the full snapshot and my new command: bsc --config ./config.toml --datadir ./node --http --diffsync -snapshot=false --gcmode=full --cache 8000 --rpc.allow-unprotected-txs --txlookuplimit 0

nohup \ geth \ --config "./config.toml" \ --datadir "./data" \ --cache "36000" \ --http.corsdomain "" \ --http.vhosts "" \ --ws.origins "*" \ --diffsync \ &

This is how I start

bnb-chain / bsc