bnb-chain / bsc

A BNB Smart Chain client based on the go-ethereum fork
GNU Lesser General Public License v3.0
2.69k stars 1.55k forks source link

All BSC nodes are OFF SYNC #189

Closed ghost closed 3 years ago

ghost commented 3 years ago

Well, I have tried to sync my own node and failed. It is syncing a week already. OK, so I decided to buy access to a node in the internet.

I have tried ankr, getblock and quiknode so far, and they ALL are OFF SYNC!!!

Please don't tell me anything about my hardware is weak or I did something wrong. Just figure out what is going on, and fix it. A month ago everything was alright.

DefiDebauchery commented 3 years ago

Not sure what there is to fix. The block size and TPS have both increased (exponentially for TPS), and hardware that was sufficient a month ago is no longer able to keep up. Push these other services to improve their resources.

I had sync/lag issues with SSDs on multiple machines. I built a new node using NVMe (PCIe, not SATA mode) and have not had a single hiccup for the several days it's been running. I won't claim that there aren't optimizations that could be done, but the blockchain is IOPS-heavy, and you need hardware to support it.

ghost commented 3 years ago

Well, if you need PCIe NVMe instead of SSD, it should be reflected in users manual at least. I have seen two different user manuals on official site, non of them said anything about NVMe. And I have already bought 3 SSD servers.

DefiDebauchery commented 3 years ago

In their defense, the manual was written long (long) before IOPS had been a limiting factor. But I definitely agree that the docs are a little stagnant as a whole.

sjors-lemniscap commented 3 years ago

After experimenting for the last week or so I can confirm that:

Hope this helps, syncing mainnet took here <48 hours and testnet <4 hours. Server is located in the EU.

EDIT: Scrolled through some other issues and people are curious how large a fast mainnet sync is on disk. Approx 191GB with the new v1.0.7-hf.1 geth version.

stakecube commented 3 years ago

Can confirm above, same results for us so far. Thanks for the summary @sjors-lemniscap Could you share how many states you have near sync atm?

sjors-lemniscap commented 3 years ago

Can confirm above, same results for us so far. Thanks for the summary @sjors-lemniscap Could you share how many states you have near sync atm?

eth.syncing is showing false since my node has been fully synced. Is there a way to show the PulledStates / KnownStates once a node is synced? Happy to show the output but I don't know the right command to retrieve this info.

stakecube commented 3 years ago

Ah okay. I don't think there is any command to check after full sync. Maybe someone else knows it. But I know it's "normal" to show false once fully synced.

This is our output/amount of states right now: { currentBlock: 7214868, highestBlock: 7214942, knownStates: 89457658, pulledStates: 89443519, startingBlock: 7214379 }

ChuckCaplan commented 3 years ago

I'm at 317 million known states with a node size of 187.5 GB syncing in fast mode. Hopefully I will be done soon.

a04512 commented 3 years ago

Can confirm above, same results for us so far. Thanks for the summary @sjors-lemniscap Could you share how many states you have near sync atm?

mine 562,191,528 592,176,122 now 610,547,008

gituser commented 3 years ago

After experimenting for the last week or so I can confirm that:

  • NVMe is required due to the insane amount of tx's and blocks. Memory / CPU stick with the recommended specs from the docs
  • Using the chaindata export provided by Binance has an extremely hard time catching up to the latest block. It goes way faster when doing a fast sync from scratch
  • The BSC geth fork contains a lot of inefficiencies if you compare it to the Ethereum geth clients. This might need some fixing by the devs in order to improve the sync times and stability going forward
  • Increasing the maxpeers to 2000 is the config.toml helps as well as updating the bootstrapnodes and staticnodes. You can set in addition the --cache flag when starting geth to use the maximum amount of your systems memory

Hope this helps, syncing mainnet took here <48 hours and testnet <4 hours. Server is located in the EU.

EDIT: Scrolled through some other issues and people are curious how large a fast mainnet sync is on disk. Approx 191GB with the new v1.0.7-hf.1 geth version.

This is indeed the correct way to sync at the moment (don't use snapshots!), if your node is stuck syncing from the snapshot, stop the node, remove the node db and then sync from scratch with fast syncing.

Also try branch upgrade_1.10.2 there is a newer version 1.1.0 (which is based on newer geth 1.10) and it seems to be working fine, you just need to comment few things from the config.toml, like GraphQLPort.

My bsc v1.1.0 instance with 12GB cache and on NVME got synced this way in about ~9 hrs the whole chaindata occupies now only 188 GB instead of the 720 GB I had before and sync always was stuck behind the blockchain for about 5-7K blocks.

ghost commented 3 years ago

may i ask how exacly --cache flag should look like if I wanna give my node 16GB of cache?

gituser commented 3 years ago

@edgeofthegame

here is the config.toml I've used for the node that got synced in ~ 9hrs:

config.toml: ``` [Eth] NetworkId = 56 SyncMode = "fast" NoPruning = false NoPrefetch = false LightPeers = 1 UltraLightFraction = 75 TrieCleanCache = 256 TrieDirtyCache = 256 TrieTimeout = 500000000000 #TrieTimeout = 3600000000000 EnablePreimageRecording = false EWASMInterpreter = "" EVMInterpreter = "" DatabaseCache = 12000 [Eth.Miner] GasFloor = 30000000 GasCeil = 40000000 GasPrice = 1000000000 Recommit = 10000000000 Noverify = false [Eth.TxPool] Locals = [] NoLocals = true Journal = "transactions.rlp" Rejournal = 3600000000000 PriceLimit = 1000000000 PriceBump = 10 AccountSlots = 512 GlobalSlots = 10000 AccountQueue = 256 GlobalQueue = 5000 Lifetime = 10800000000000 #[Eth.GPO] #Blocks = 20 #Percentile = 60 [Node] IPCPath = "geth.ipc" HTTPHost = "127.0.0.1" NoUSB = true InsecureUnlockAllowed = false HTTPPort = 8575 HTTPVirtualHosts = ["localhost"] HTTPModules = ["eth", "net", "web3", "txpool", "parlia"] WSPort = 8546 WSModules = ["net", "web3", "eth"] #GraphQLPort = 8557 GraphQLVirtualHosts = ["*"] [Node.P2P] MaxPeers = 1000 NoDiscovery = false BootstrapNodes = ["enode://1cc4534b14cfe351ab740a1418ab944a234ca2f702915eadb7e558a02010cb7c5a8c295a3b56bcefa7701c07752acd5539cb13df2aab8ae2d98934d712611443@52.71.43.172:30311","enode://28b1d16562dac280dacaaf45d54516b85bc6c994252a9825c5cc4e080d3e53446d05f63ba495ea7d44d6c316b54cd92b245c5c328c37da24605c4a93a0d099c4@34.246.65.14:30311","enode://5a7b996048d1b0a07683a949662c87c09b55247ce774aeee10bb886892e586e3c604564393292e38ef43c023ee9981e1f8b335766ec4f0f256e57f8640b079d5@35.73.137.11:30311"] StaticNodes = ["enode://f3cfd69f2808ef64838abd8786342c0b22fdd28268703c8d6812e26e109f9a7cb2b37bd49724ebb46c233289f22da82991c87345eb9a2dadeddb8f37eeb259ac@18.180.28.21:30311","enode://ae74385270d4afeb953561603fcedc4a0e755a241ffdea31c3f751dc8be5bf29c03bf46e3051d1c8d997c45479a92632020c9a84b96dcb63b2259ec09b4fde38@54.178.30.104:30311","enode://d1cabe083d5fc1da9b510889188f06dab891935294e4569df759fc2c4d684b3b4982051b84a9a078512202ad947f9240adc5b6abea5320fb9a736d2f6751c52e@54.238.28.14:30311","enode://f420209bac5324326c116d38d83edfa2256c4101a27cd3e7f9b8287dc8526900f4137e915df6806986b28bc79b1e66679b544a1c515a95ede86f4d809bd65dab@54.178.62.117:30311","enode://c0e8d1abd27c3c13ca879e16f34c12ffee936a7e5d7b7fb6f1af5cc75c6fad704e5667c7bbf7826fcb200d22b9bf86395271b0f76c21e63ad9a388ed548d4c90@54.65.247.12:30311","enode://f1b49b1cf536e36f9a56730f7a0ece899e5efb344eec2fdca3a335465bc4f619b98121f4a5032a1218fa8b69a5488d1ec48afe2abda073280beec296b104db31@13.114.199.41:30311","enode://4924583cfb262b6e333969c86eab8da009b3f7d165cc9ad326914f576c575741e71dc6e64a830e833c25e8c45b906364e58e70cdf043651fd583082ea7db5e3b@18.180.17.171:30311","enode://4d041250eb4f05ab55af184a01aed1a71d241a94a03a5b86f4e32659e1ab1e144be919890682d4afb5e7afd837146ce584d61a38837553d95a7de1f28ea4513a@54.178.99.222:30311","enode://b5772a14fdaeebf4c1924e73c923bdf11c35240a6da7b9e5ec0e6cbb95e78327690b90e8ab0ea5270debc8834454b98eca34cc2a19817f5972498648a6959a3a@54.170.158.102:30311","enode://f329176b187cec87b327f82e78b6ece3102a0f7c89b92a5312e1674062c6e89f785f55fb1b167e369d71c66b0548994c6035c6d85849eccb434d4d9e0c489cdd@34.253.94.130:30311","enode://cbfd1219940d4e312ad94108e7fa3bc34c4c22081d6f334a2e7b36bb28928b56879924cf0353ad85fa5b2f3d5033bbe8ad5371feae9c2088214184be301ed658@54.75.11.3:30311","enode://c64b0a0c619c03c220ea0d7cac754931f967665f9e148b92d2e46761ad9180f5eb5aaef48dfc230d8db8f8c16d2265a3d5407b06bedcd5f0f5a22c2f51c2e69f@54.216.208.163:30311","enode://352a361a9240d4d23bb6fab19cc6dc5a5fc6921abf19de65afe13f1802780aecd67c8c09d8c89043ff86947f171d98ab06906ef616d58e718067e02abea0dda9@79.125.105.65:30311","enode://bb683ef5d03db7d945d6f84b88e5b98920b70aecc22abed8c00d6db621f784e4280e5813d12694c7a091543064456ad9789980766f3f1feb38906cf7255c33d6@54.195.127.237:30311","enode://11dc6fea50630b68a9289055d6b0fb0e22fb5048a3f4e4efd741a7ab09dd79e78d383efc052089e516f0a0f3eacdd5d3ffbe5279b36ecc42ad7cd1f2767fdbdb@46.137.182.25:30311","enode://21530e423b42aed17d7eef67882ebb23357db4f8b10c94d4c71191f52955d97dc13eec03cfeff0fe3a1c89c955e81a6970c09689d21ecbec2142b26b7e759c45@54.216.119.18:30311","enode://d61a31410c365e7fcd50e24d56a77d2d9741d4a57b295cc5070189ad90d0ec749d113b4b0432c6d795eb36597efce88d12ca45e645ec51b3a2144e1c1c41b66a@34.204.129.242:30311","enode://bb91215b1d77c892897048dd58f709f02aacb5355aa8f50f00b67c879c3dffd7eef5b5a152ac46cdfb255295bec4d06701a8032456703c6b604a4686d388ea8f@75.101.197.198:30311","enode://786acbdf5a3cf91b99047a0fd8305e11e54d96ea3a72b1527050d3d6f8c9fc0278ff9ef56f3e56b3b70a283d97c309065506ea2fc3eb9b62477fd014a3ec1a96@107.23.90.162:30311","enode://4653bc7c235c3480968e5e81d91123bc67626f35c207ae4acab89347db675a627784c5982431300c02f547a7d33558718f7795e848d547a327abb111eac73636@54.144.170.236:30311","enode://c6ffd994c4ef130f90f8ee2fc08c1b0f02a6e9b12152092bf5a03dd7af9fd33597d4b2e2000a271cc0648d5e55242aeadd6d5061bb2e596372655ba0722cc704@54.147.151.108:30311","enode://99b07e9dc5f204263b87243146743399b2bd60c98f68d1239a3461d09087e6c417e40f1106fa606ccf54159feabdddb4e7f367559b349a6511e66e525de4906e@54.81.225.170:30311","enode://1479af5ea7bda822e8747d0b967309bced22cad5083b93bc6f4e1d7da7be067cd8495dc4c5a71579f2da8d9068f0c43ad6933d2b335a545b4ae49a846122b261@52.7.247.132:30311"] ListenAddr = ":30311" EnableMsgEvents = false [Node.HTTPTimeouts] ReadTimeout = 30000000000 WriteTimeout = 30000000000 IdleTimeout = 120000000000 [Node.LogConfig] FilePath = "bsc.log" MaxBytesSize = 104857600 Level = "info" FileRoot = "" ```

The config directive for cache is: DatabaseCache = 12000

NOTE: geth actually takes bit more memory than what you specify in --cache or Database, e.g. it takes about 17GB right now with that setting for me, so make sure your VM has more memory or there is additional swap.

a04512 commented 3 years ago

@edgeofthegame

here is the config.toml I've used for the node that got synced in ~ 9hrs:

config.toml:

[Eth]
NetworkId = 56
SyncMode = "fast"
NoPruning = false
NoPrefetch = false
LightPeers = 1
UltraLightFraction = 75
TrieCleanCache = 256
TrieDirtyCache = 256
TrieTimeout = 500000000000
#TrieTimeout = 3600000000000
EnablePreimageRecording = false
EWASMInterpreter = ""
EVMInterpreter = ""
DatabaseCache = 12000

[Eth.Miner]
GasFloor = 30000000
GasCeil = 40000000
GasPrice = 1000000000
Recommit = 10000000000
Noverify = false

[Eth.TxPool]
Locals = []
NoLocals = true
Journal = "transactions.rlp"
Rejournal = 3600000000000
PriceLimit = 1000000000
PriceBump = 10
AccountSlots = 512
GlobalSlots = 10000
AccountQueue = 256
GlobalQueue = 5000
Lifetime = 10800000000000

#[Eth.GPO]
#Blocks = 20
#Percentile = 60

[Node]
IPCPath = "geth.ipc"
HTTPHost = "127.0.0.1"
NoUSB = true
InsecureUnlockAllowed = false
HTTPPort = 8575
HTTPVirtualHosts = ["localhost"]
HTTPModules = ["eth", "net", "web3", "txpool", "parlia"]
WSPort = 8546
WSModules = ["net", "web3", "eth"]
#GraphQLPort = 8557
GraphQLVirtualHosts = ["*"]

[Node.P2P]
MaxPeers = 1000
NoDiscovery = false
BootstrapNodes = ["enode://1cc4534b14cfe351ab740a1418ab944a234ca2f702915eadb7e558a02010cb7c5a8c295a3b56bcefa7701c07752acd5539cb13df2aab8ae2d98934d712611443@52.71.43.172:30311","enode://28b1d16562dac280dacaaf45d54516b85bc6c994252a9825c5cc4e080d3e53446d05f63ba495ea7d44d6c316b54cd92b245c5c328c37da24605c4a93a0d099c4@34.246.65.14:30311","enode://5a7b996048d1b0a07683a949662c87c09b55247ce774aeee10bb886892e586e3c604564393292e38ef43c023ee9981e1f8b335766ec4f0f256e57f8640b079d5@35.73.137.11:30311"]
StaticNodes = ["enode://f3cfd69f2808ef64838abd8786342c0b22fdd28268703c8d6812e26e109f9a7cb2b37bd49724ebb46c233289f22da82991c87345eb9a2dadeddb8f37eeb259ac@18.180.28.21:30311","enode://ae74385270d4afeb953561603fcedc4a0e755a241ffdea31c3f751dc8be5bf29c03bf46e3051d1c8d997c45479a92632020c9a84b96dcb63b2259ec09b4fde38@54.178.30.104:30311","enode://d1cabe083d5fc1da9b510889188f06dab891935294e4569df759fc2c4d684b3b4982051b84a9a078512202ad947f9240adc5b6abea5320fb9a736d2f6751c52e@54.238.28.14:30311","enode://f420209bac5324326c116d38d83edfa2256c4101a27cd3e7f9b8287dc8526900f4137e915df6806986b28bc79b1e66679b544a1c515a95ede86f4d809bd65dab@54.178.62.117:30311","enode://c0e8d1abd27c3c13ca879e16f34c12ffee936a7e5d7b7fb6f1af5cc75c6fad704e5667c7bbf7826fcb200d22b9bf86395271b0f76c21e63ad9a388ed548d4c90@54.65.247.12:30311","enode://f1b49b1cf536e36f9a56730f7a0ece899e5efb344eec2fdca3a335465bc4f619b98121f4a5032a1218fa8b69a5488d1ec48afe2abda073280beec296b104db31@13.114.199.41:30311","enode://4924583cfb262b6e333969c86eab8da009b3f7d165cc9ad326914f576c575741e71dc6e64a830e833c25e8c45b906364e58e70cdf043651fd583082ea7db5e3b@18.180.17.171:30311","enode://4d041250eb4f05ab55af184a01aed1a71d241a94a03a5b86f4e32659e1ab1e144be919890682d4afb5e7afd837146ce584d61a38837553d95a7de1f28ea4513a@54.178.99.222:30311","enode://b5772a14fdaeebf4c1924e73c923bdf11c35240a6da7b9e5ec0e6cbb95e78327690b90e8ab0ea5270debc8834454b98eca34cc2a19817f5972498648a6959a3a@54.170.158.102:30311","enode://f329176b187cec87b327f82e78b6ece3102a0f7c89b92a5312e1674062c6e89f785f55fb1b167e369d71c66b0548994c6035c6d85849eccb434d4d9e0c489cdd@34.253.94.130:30311","enode://cbfd1219940d4e312ad94108e7fa3bc34c4c22081d6f334a2e7b36bb28928b56879924cf0353ad85fa5b2f3d5033bbe8ad5371feae9c2088214184be301ed658@54.75.11.3:30311","enode://c64b0a0c619c03c220ea0d7cac754931f967665f9e148b92d2e46761ad9180f5eb5aaef48dfc230d8db8f8c16d2265a3d5407b06bedcd5f0f5a22c2f51c2e69f@54.216.208.163:30311","enode://352a361a9240d4d23bb6fab19cc6dc5a5fc6921abf19de65afe13f1802780aecd67c8c09d8c89043ff86947f171d98ab06906ef616d58e718067e02abea0dda9@79.125.105.65:30311","enode://bb683ef5d03db7d945d6f84b88e5b98920b70aecc22abed8c00d6db621f784e4280e5813d12694c7a091543064456ad9789980766f3f1feb38906cf7255c33d6@54.195.127.237:30311","enode://11dc6fea50630b68a9289055d6b0fb0e22fb5048a3f4e4efd741a7ab09dd79e78d383efc052089e516f0a0f3eacdd5d3ffbe5279b36ecc42ad7cd1f2767fdbdb@46.137.182.25:30311","enode://21530e423b42aed17d7eef67882ebb23357db4f8b10c94d4c71191f52955d97dc13eec03cfeff0fe3a1c89c955e81a6970c09689d21ecbec2142b26b7e759c45@54.216.119.18:30311","enode://d61a31410c365e7fcd50e24d56a77d2d9741d4a57b295cc5070189ad90d0ec749d113b4b0432c6d795eb36597efce88d12ca45e645ec51b3a2144e1c1c41b66a@34.204.129.242:30311","enode://bb91215b1d77c892897048dd58f709f02aacb5355aa8f50f00b67c879c3dffd7eef5b5a152ac46cdfb255295bec4d06701a8032456703c6b604a4686d388ea8f@75.101.197.198:30311","enode://786acbdf5a3cf91b99047a0fd8305e11e54d96ea3a72b1527050d3d6f8c9fc0278ff9ef56f3e56b3b70a283d97c309065506ea2fc3eb9b62477fd014a3ec1a96@107.23.90.162:30311","enode://4653bc7c235c3480968e5e81d91123bc67626f35c207ae4acab89347db675a627784c5982431300c02f547a7d33558718f7795e848d547a327abb111eac73636@54.144.170.236:30311","enode://c6ffd994c4ef130f90f8ee2fc08c1b0f02a6e9b12152092bf5a03dd7af9fd33597d4b2e2000a271cc0648d5e55242aeadd6d5061bb2e596372655ba0722cc704@54.147.151.108:30311","enode://99b07e9dc5f204263b87243146743399b2bd60c98f68d1239a3461d09087e6c417e40f1106fa606ccf54159feabdddb4e7f367559b349a6511e66e525de4906e@54.81.225.170:30311","enode://1479af5ea7bda822e8747d0b967309bced22cad5083b93bc6f4e1d7da7be067cd8495dc4c5a71579f2da8d9068f0c43ad6933d2b335a545b4ae49a846122b261@52.7.247.132:30311"]
ListenAddr = ":30311"
EnableMsgEvents = false

[Node.HTTPTimeouts]
ReadTimeout = 30000000000
WriteTimeout = 30000000000
IdleTimeout = 120000000000

[Node.LogConfig]
FilePath = "bsc.log"
MaxBytesSize = 104857600
Level = "info"
FileRoot = ""

The config directive for cache is: DatabaseCache = 12000

NOTE: geth actually takes bit more memory than what you specify in --cache or Database, e.g. it takes about 17GB right now with that setting for me, so make sure your VM has more memory or there is additional swap.

fast sync mode, 9 hours get full synced?

koen84 commented 3 years ago

Would probably be good indeed, for the overall health of the network, if docs got updated and in particular clarify the demand on storage. If there's a significant amount of nodes with subpar specs, they could affect the nodes they peer with.

gituser commented 3 years ago

@a04512 yes, fast sync from scratch in 9 hours, fully synced.

here is my HW specs: i9-9900K, 2xNVME 1TB in RAID1, mem: 24GB

although there are some other stuff running on the same machine, but bsc is giving it most I/O, CPU intensive load

zcrypt0 commented 3 years ago

CPU load seems to be a limiter too, i3.xlarge smallest AWS instance I've had success with.

Another thing, I noticed much better peering when setting up the AWS time sync service. One of my nodes went from no peers to enough to do a sync:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html

a04512 commented 3 years ago

@gituser how many states need to be fully synced , now i have 620M ,don't know how much really

afanasy commented 3 years ago

@a04512 Number of state entries is different depending if you restarted node or not. @holiman explained why it is so here: https://github.com/ethereum/go-ethereum/issues/14647#issuecomment-682098181 .

bellsovery commented 3 years ago

I'm using i3.xlarge with 1TG NVMe SSD. It's been already 7 days, but it keeps 50~100 blocks behind.

Please give me any advice

afanasy commented 3 years ago

@bellsovery You need more CPU. AWS vCPUs are not real CPUs, they are threads on multicore CPUs. xlarge = 4 vCPU = 4 threads = 2 CPU cores. It worked for me on i3en.2xlarge - synced in about 10 hours (was a week ago).

bellsovery commented 3 years ago

@afanasy Thanks. I will try on i3en.2xlarge

zcrypt0 commented 3 years ago

@bellsovery I have an i3.xlarge and i3.2xlarge synced to the tip. i3.2xlarge is faster and stays more consistently in sync, but the xlarge is working.

Make sure you are not using ext4 filesystem for the nvme mount. After I switched to xfs i had better perf.

bellsovery commented 3 years ago

@zcrypt0 Oh, really? I was using ext4 filesystem. Thanks for your advices!

gobiyoga commented 3 years ago

What peer counts are you guys getting? I cant seem to get my node above 18 but it seems to be syncing fast (full chain in 3.5 hours using fast sync) onto nvme drive

easeev commented 3 years ago

What peer counts are you guys getting? I cant seem to get my node above 18 but it seems to be syncing fast (full chain in 3.5 hours using fast sync) onto nvme drive

From 17 to 64 on different nodes with --maxpeers 200

bellsovery commented 3 years ago

What peer counts are you guys getting? I cant seem to get my node above 18 but it seems to be syncing fast (full chain in 3.5 hours using fast sync) onto nvme drive

My node has 618 peers

a04512 commented 3 years ago

@afanasy But it seems never end on bsc , mine is nearly 700M,ethereum have 12M blocks with 800M states to be fully synced

afanasy commented 3 years ago

@a04512 It means your node is too slow and can't catch up, so it just keeps downloading state entries. You need to use better hardware (more CPU power, faster storage). On proper hardware BSC node syncs in fast sync mode (default mode) in about 10 hours, taking 170Gb of storage space, and downloading about 300M state entries (in one continuous run, without node restarts). Also make sure you are using bsc geth v1.0.7-hf.1 or higher, because otherwise it will start consuming a lot of space (after the sync finishes), see https://github.com/binance-chain/bsc/pull/190.

afanasy commented 3 years ago

@gobiyoga 3.5 hours is a fantastic result, are you sure it is fully synced, with all state entries downloaded, not just blocks? And with correct genesis block (low peer count may indicate wrong genesis block)?

a04512 commented 3 years ago

@afanasy
Snipaste_2021-05-10_14-42-31

Snipaste_2021-05-10_14-41-36

This is my iostat data. I don't know if I need better hardware, because it doesn't seem to use 100% of the hardware resources at present and my geth is 10.10.3 now it takes 233GB

afanasy commented 3 years ago

@a04512 Yes, definitely makes no sense to upgrade hardware, if you can't clearly see the bottleneck. iostat looks good, what does top say about load average and CPUs (press 1 for CPU list)? There is used Swap, may be it is using disk swap too much, which slows it down?

a04512 commented 3 years ago

@afanasy swap is to prevent geth process killed by oom,and my avarage cpu loads is 40% (goolge cloud c2-standard-4 4 vCPU,16 GB memory)

gobiyoga commented 3 years ago

@gobiyoga 3.5 hours is a fantastic result, are you sure it is fully synced, with all state entries downloaded, not just blocks? And with correct genesis block (low peer count may indicate wrong genesis block)?

Looking into it, I realized that all state entries weren't downloaded and the blockchain was constantly behind.

Now I've compiled upgrade_1.10.2 with Go 1.15.12 and passed through the nvme drive directly to the VM formatted with xfs avoiding VirtIO as a potential bottleneck. It's been running for 4 hours without interruptions and here is where my sync is at:

eth.syncing { currentBlock: 6883878, highestBlock: 7262583, knownStates: 53416862, pulledStates: 53354484, startingBlock: 0 }

afanasy commented 3 years ago

@a04512 Swap may be the problem. If OS is using disk instead of the memory, it will slow the things down. So you may want to reduce the geth cache (--cache=4096 command line option or DatabaseCache = 4096) until oom stops killing it. Also I had issues with similar cloud config on AWS (i3.xlarge with 4x 2.3 GHz vCPUs), so I've ended up with using i3en.2xlarge with 8x 3.1 GHz vCPUs, and it worked. But @zcrypt0 said above that it works for him on i3.xlarge 4x vCPUs (may be because he is using xfs instead of default ext4).

afanasy commented 3 years ago

@gobiyoga

Looking into it, I realized that all state entries weren't downloaded and the blockchain was constantly behind.

Yes, when the geth log says top block is 3m ago, and you think the sync is done, in reality the state entries are not downloaded yet, and it will still take hours to complete. It is very confusing when you are doing it for the first time.

farmer69420 commented 3 years ago

What peer counts are you guys getting? I cant seem to get my node above 18 but it seems to be syncing fast (full chain in 3.5 hours using fast sync) onto nvme drive

From 17 to 64 on different nodes with --maxpeers 200

you need to allow incoming traffic to your node on port 30311 to get more peers

bellsovery commented 3 years ago

I'm using i3.2xlarge with 8 vCPUs/1.9TB NVMe/64GB RAM. It's been 6 hours and current CPU load is 700%, peers count is 998. Here is the syncing status.

{
  currentBlock: 6887723,
  highestBlock: 7259268,
  knownStates: 259352216,
  pulledStates: 259352216,
  startingBlock: 0
}
billyriantono commented 3 years ago

I'm using i3.2xlarge with 8 vCPUs/1.9TB NVMe/64GB RAM. It's been 6 hours and current CPU load is 700%, peers count is 998. Here is the syncing status.

{
  currentBlock: 6887723,
  highestBlock: 7259268,
  knownStates: 259352216,
  pulledStates: 259352216,
  startingBlock: 0
}

how much it's will cost monthly 😃 @bellsovery

ghost commented 3 years ago

image BscScan is OFF Sync 😂😂😂

ghost commented 3 years ago

On the bright side NVMe seems to fix the problem. My home pc with 1TB NVMe was being able to sync in 48 hours from genesis.

bellsovery commented 3 years ago

I'm using i3.2xlarge with 8 vCPUs/1.9TB NVMe/64GB RAM. It's been 6 hours and current CPU load is 700%, peers count is 998. Here is the syncing status.

{
  currentBlock: 6887723,
  highestBlock: 7259268,
  knownStates: 259352216,
  pulledStates: 259352216,
  startingBlock: 0
}

My node is importing 1~2 blocks per second, and it's 30,000 blocks behind. Not sure it's able to get fully synced or not. Very slow speed

afanasy commented 3 years ago

@bellsovery It should be able, it needs to finish downloading the state entries, then it will catch up on blocks. It should stop mentioning state entries in the log and automatically switch to full sync mode, then fast sync is done. But if after fast sync is done and it still can't catch up with incoming blocks + txs stream, then its a problem.

bellsovery commented 3 years ago

@afanasy - There is no states downloading now, only block downloading. It seems the fast sync is done.

Here is the bsc.log

t=2021-05-10T14:16:43+0000 lvl=info msg="Imported new chain segment"          blocks=8   txs=3458 mgas=425.777 elapsed=8.010s    mgasps=53.149 number=7265786 hash=0x88b38a14af4e90f1928d1d4a04ab763119844b8d8450227efed887dc22a6a73b age=1d5h6m  dirty="508.72 MiB"
t=2021-05-10T14:16:51+0000 lvl=info msg="Imported new chain segment"          blocks=11  txs=3242 mgas=450.416 elapsed=8.514s    mgasps=52.900 number=7265797 hash=0xd5f1078415125c772c1c4917446ecd02e470863b09c6dedbefe424aa869ec37e age=1d5h5m  dirty="528.40 MiB"
t=2021-05-10T14:17:00+0000 lvl=info msg="Imported new chain segment"          blocks=10  txs=4500 mgas=475.948 elapsed=8.410s    mgasps=56.590 number=7265807 hash=0x91630b892663514d6cf9fe0e31941683d8d1e0d38aa5800ba416a0400f624154 age=1d5h5m  dirty="538.82 MiB"
t=2021-05-10T14:17:00+0000 lvl=info msg="Deep froze chain segment"            blocks=69  elapsed=74.792ms  number=7175807 hash=0x97e9960018df4d6374522880c8add577d4b0e711e7bf652d16432dc4db2ef73f
t=2021-05-10T14:17:08+0000 lvl=info msg="Imported new chain segment"          blocks=10  txs=3471 mgas=469.414 elapsed=8.289s    mgasps=56.629 number=7265817 hash=0x5276be130bc0179fbd5673a1b9097196c6a90e7097d6b3e724ff9808f6cbf10f age=1d5h5m  dirty="550.63 MiB"
t=2021-05-10T14:17:17+0000 lvl=info msg="Imported new chain segment"          blocks=11  txs=3631 mgas=476.068 elapsed=8.756s    mgasps=54.369 number=7265828 hash=0x348e9f94ea04e55f40883ea7990c652303ae89c24ff2c30b6fd7b493ddd1fb09 age=1d5h4m  dirty="560.36 MiB"
t=2021-05-10T14:17:25+0000 lvl=info msg="Imported new chain segment"          blocks=10  txs=3224 mgas=445.272 elapsed=8.212s    mgasps=54.221 number=7265838 hash=0x97438870103314c16ea5278f339a8c28635da2bf0f13bc35f2c22bd5c281b2e9 age=1d5h4m  dirty="569.40 MiB"
t=2021-05-10T14:17:34+0000 lvl=info msg="Imported new chain segment"          blocks=11  txs=3525 mgas=481.614 elapsed=8.503s    mgasps=56.639 number=7265849 hash=0xf3036d82a2da7149fe444bcc7a78cdb4b98de3187af5be482b1dbf9a8f0c309d age=1d5h3m  dirty="581.16 MiB"
t=2021-05-10T14:17:42+0000 lvl=info msg="Imported new chain segment"          blocks=10  txs=2991 mgas=444.571 elapsed=8.191s    mgasps=54.275 number=7265859 hash=0xaf89f8a76cdfc8af5c319db91b142a4f36d52af773306ca5d247ce5b2d5cca1c age=1d5h3m  dirty="588.26 MiB"
t=2021-05-10T14:17:50+0000 lvl=info msg="Imported new chain segment"          blocks=10  txs=3499 mgas=451.084 elapsed=8.224s    mgasps=54.848 number=7265869 hash=0x28fe78585d9c1c96fd4571965f9436d0e0901ede6c993b660452384c76b4f879 age=1d5h3m  dirty="599.04 MiB"

Here is the eth.syncing

{
  currentBlock: 7265974,
  highestBlock: 7294768,
  knownStates: 259605097,
  pulledStates: 259605097,
  startingBlock: 7262720
}
afanasy commented 3 years ago

@bellsovery Yes, seems it in the full sync mode now. Looks like the fast sync didn't go to the tip of the chain and switched to the full sync earlier for some reason. However from the log you can see that it is processing 10 blocks per 8 sec = 30 sec per 8 sec, so it is syncing faster than the chain grows and should be able to sync - theoretically. In practice I'd leave this one syncing (in case it is still able to finish) and try all over again on i3en.2xlarge (faster CPU) with xfs and using AWS time sync service https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html, and bsc geth 1.0.7-hf.2 (it worked for me 1 week ago).

bellsovery commented 3 years ago

@afanasy Great advice! Thanks!

gobiyoga commented 3 years ago

What peer counts are you guys getting? I cant seem to get my node above 18 but it seems to be syncing fast (full chain in 3.5 hours using fast sync) onto nvme drive

From 17 to 64 on different nodes with --maxpeers 200

you need to allow incoming traffic to your node on port 30311 to get more peers

I have port 30311 TCP/UDP opened but my peer count still keeps dropping, currently sitting at 17. NTP is enabled and syncing with my GPS based NTP server as well as some external NTP sources.

My node is using 93.7GB/96GB RAM and iostat is showing me that nothing much is going on, guessing the state entries are still being synced: image

ghost commented 3 years ago

I finnaly made it! Full sync from genesis took 21 hours this time. Server: AMD Ryzen 7 3700X 8-Core. 64GB RAM, 1TB NVMe xfs formatted (as zcrypt0 recomended). I used config.toml provided by gituser and upgrade_1.10.2 branch of this repo.

Final stats:

instance: Geth/v1.1.0-stable-d3313443/linux-amd64/go1.15.3 at block: 0 (Mon Apr 20 2020 16:46:54 GMT+0300 (MSK)) modules: eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0

To exit, press ctrl-d net.peerCount 200 eth.syncing { currentBlock: 7300590, highestBlock: 7300591, knownStates: 278562927, pulledStates: 278562927, startingBlock: 7300584 } eth.blockNumber 7300591 eth.syncing false

Current size: 187GB

zcrypt0 commented 3 years ago

An update on the i3.xlarge node: It is still working good and consistently keeping synced, but the average block lag (measured by the soonest a client can receive a new block vs the block timestamp) lags behind the average block lag of the i3.2xlarge nodes.

I did some testing with random read/write disk speeds today and noticed this:

i3.2xlarge: 590MiB/s read, 150MiB/s write

i3.4xlarge (w/ software RAID 0 on 2 nvme drives): 690MiB/s read, 175 MiB/s write

i3en.2xlarge: 420MiB/s read, 105MiB/s write

i3en.2xlarge (w/ software RAID 0 on 2 nvme drives): 470MiB/s read, 120MiB/s write

So the i3 instances achieve significantly better random read/write speeds than the i3en instances. The introduction of a software raid0 did not improve the speeds enough to be comparable to the i3 instances.

I haven't tried actually syncing the i3en instance, so the faster cpus (as reported by aws) may be helpful even with the lower disk speeds.

koen84 commented 3 years ago

Unless you're on crap CPU, the IOPS of storage matter much more.

@a04512 16GB RAM seems potentially low imho.

zhongfu commented 3 years ago

Adding another data point here:

Intel NUC8i5BEK

Fast sync completed in ~8h 40min. Final disk space usage was ~180GiB. 291,868,354 state trie entries imported, with block height at 7,311,851 when it switched to full sync mode. Right now, it's just generating the state snapshot, but is otherwise keeping up with the head of the chain.

I've had two or three past failed attempts (fast sync, full sync from 2021-05-02 snapshot) on slightly differing hardware (1TB PM981, 2x2TB Micron 5100 ECO SATA in ZFS mirror), but the common denominator was that I tried to use ZFS (ashift=12/13, compression=off, noatime). Switching to XFS for this attempt allowed me to catch up with head really quickly.