[Bug] Delay in node -> farmer communication

bhorvitz commented 1 year ago

What happened?

Normally, full_node will finish a SP and pass it to the farmer in a few ms. On occasion, this interval can be many seconds. Passed down the line to a harvester which might take 3-4 seconds to finish a lookup, it can possibly overlap the next SP, causing spikes in lookup times since that harvester will then be trying to process more than one thing at a time. This is particularly problematic with GPU farming.

In the log output below, you can see normal intervals and then the 07:14:05.336 SP takes 3.2s and 07:14:22.934 takes 5.4s.

This is easily reproducible. It happens every few minutes.

Version

1.8.2rc6.dev115

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

2023-07-09T07:13:36.739 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 55/64: CC: 465dc7bb03c6a540442e2c12106b5f7b99ea890f8cf74d1e67d0d04755177a6e RC: cf7a61f89a1da221b84ef12e2f49fca678f9555410ab24450bf683aedaee5802 
2023-07-09T07:13:36.742 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

2023-07-09T07:13:47.768 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 56/64: CC: 0f94eb208e109899219b547ed0d98199e854a7d9c2a77ffa965f914a46f66bfc RC: 4c4ddad2c1f527a7b83dcbc1d2d200d8526fb189d4ce2cda4dc18aaf1eacda76 
2023-07-09T07:13:48.088 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

2023-07-09T07:13:56.538 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 57/64: CC: 964304506803fad3b7763dbff13242fadf41525eb8e39eac7fee71cc93d7ed4a RC: 19c0d72c02bbe4f4e316382cb49a15ce28172502512c0796977ee15bb871a402 
2023-07-09T07:13:56.541 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

2023-07-09T07:14:05.336 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 58/64: CC: 3d6650b46bf8399a87fe9e1a7bad21ca1b0899f39f7a233610b140f9d2a45cdc RC: eb5ce41c50cf47695e9851eb02efff93f15fee6521e4aa424255a8eb4711795e 
2023-07-09T07:14:08.518 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

2023-07-09T07:14:14.170 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 59/64: CC: 5549b7ce4ea54daafffb831514e49ce90aa930f5f3381edfcd6b007eef79d376 RC: 42162b8fd62115131882f77b7a17026219e696bed0d70a15c6f98fbb24f70cc5 
2023-07-09T07:14:14.173 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

2023-07-09T07:14:22.934 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 60/64: CC: 71fbe9e1815680ee7466c0c7dde378f802fadf23c2c86787cb91f1c632fb9a53 RC: 0513c678483f1bc030ad7ac174541aef10d847587f5c3ceb0935f8e98ae84e21 
2023-07-09T07:14:28.395 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

2023-07-09T07:14:32.030 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 61/64: CC: 0052772e288ddcf9aa832bb8d8c64f53f2991a557809f7b070a3e1cda7f16f7b RC: 6482d75cb561a40f5f56c072c09612bb83b4e3303535df3cb7951a73b8b2a220 
2023-07-09T07:14:32.033 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

bhorvitz commented 1 year ago

Confirmed the same exists in 2.0.0b5.dev76 (alpha 4.6) 5.3 second internal delay:

2023-07-12T15:04:00.186 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 3/64: CC: bba42e749020ad525d999701c8b5eb5ab0ccc97305367d0da5422896457f967c RC: 5b7d27b9a2af00f08ce6b3b04c73a84d0a823bd9e7c723e1f769dfe11847d3c0
2023-07-12T15:04:05.416 farmer farmer_server              : DEBUG    <- new_signage_point from peer be74a331abf2c88836ea0660f437bff4d4f2236d3039ebd33262e03881b7c94e 127.0.0.1

bhorvitz commented 1 year ago

Per wallentx, adding some context to the environment: This node is a single Xeon 6134 with 128G RAM running Cent9. OS and DB are on a RAID 10 of 12G SAS SSD. The tasks this system is performing are running a full node, wallet, and farmer. There is some stuff going on to the side like Prometheus and Philip Norman's chia-monitor for monitoring, syslog, and the usual OS stuff.

All harvesting is done remotely on a single harvester of dual Xeon 8272 with 256G RAM and a 3080Ti also running Cent9 doing GPU harvesting of about 6.5PiB of C5s, however as mentioned, this problem seems to occur well before it reaches out to the harvester.

bhorvitz commented 1 year ago

I was able to determine that this delay was being caused by chia-monitor. Disabling it caused all of the internal node->farmer delay to go away.

I'll leave this open for someone else to decide to close or not. chia-monitor is no longer maintained however it is not doing anything other than making rpc calls, so should that really impact node operations?

wjblanke commented 1 year ago

hmmm... no idea but maybe someone else will see this and know about chia-monitor

github-actions[bot] commented 1 year ago

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

Chia-Network / chia-blockchain