Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

[Bug] Increased CPU usage and Higher Partial Times #15836

Open PhillWalker opened 1 year ago

PhillWalker commented 1 year ago

What happened?

Since Chia Compression 4.3 alpha through to the current 2.0 RC1, partial times have increased noticeably, causing lots of lates on lower end equipment.

LebanonJon has been able to verify that the CPU usage has increased since this time, and WallentX has also acknowledged this increased resourced usage for non compression, and wonders if this has also impacted compression.

Version

2.0 RC1

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

No response

PhillWalker commented 1 year ago
Screenshot 2023-07-22 at 12 24 22

You can see the sudden rise in partial times.

Previous version running in the screenshot was the last compression_fc branch build. And the version where the jump happened was 2.0 Beta5 from main branch.

Issue still persists from early data on 2.0 RC1

Coyote-UK commented 1 year ago

I can confirm I have seen this same issue using my Windows GUI harvester. Windows 10, 2,453 C5 K32 plots. i7-4500U processor, 16GB DDR3 memory, SH920 mSATA HDD for mainnet db. Logfile for 2.0.0b5 attached covering startup of node/farmer/harvester & 6 hours of farming, including several lates. debug.log

PhillWalker commented 1 year ago

It is worth noting, this is not the config.yaml spelling issue.

These are freshly generated config.yaml files for RC1, using the same values as before

PhillWalker commented 1 year ago
Screenshot 2023-07-24 at 12 36 28

Lates being caused since the upgrade. These were all 25-30 seconds previously with no lates.

Coyote-UK commented 1 year ago

I verified, in the settings tab on the GUI, that the correct settings for decompressOrs was being read.

Coyote-UK commented 1 year ago

My setup had fully functional decompressor (previously decompresser) settings.

I do not believe this issue is resolved.

PhillWalker commented 1 year ago

I concur, my settings are correct too, and the issue is far from resolved

harold-b commented 1 year ago

Apologies I did not see this sooner. My github I have my github notifications messed up due to a mail filter.

Thanks for the report. Any chance you guys can run some tests directly with the bladebit exe on simulate --power n mode, Using the same --seed on both runs, from the tags where you saw the regression first happen?

PhillWalker commented 1 year ago

Unfortunately the hardware I am using seeing the worst of the issues (with the lates), is a 2014 Synology, and it does not support the newer versions of simulate

Fatal Error: getrandom syscall failed with error 38.

Coyote-UK commented 1 year ago

I plotted on a much more powerful, Ubuntu, machine than my power efficient Windows box with mobile CPU.

I will have to attach a suitable SSD to house a plot for a meaningful simulate, so I will get back to you as soon as I can.

Coyote-UK commented 1 year ago

Sorry for the delay. Here are the stats.

Let me know if you need anything else.

Using 2 threads and 2 parallel decompressors, as that was found to give the least stale partials on my system, so it's what I normally use for farming.

Choosing size 200TB to roughly match my farm

bladebit_cuda alpha 4.3

bladebit_cuda-a4-3 -t 2 simulate --no-cuda -n 100 -p 2 --power 360 --size 200TB --seed e11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd plot-k32-c05-Test.plot [Simulator for harvester farm capacity for K32 C5 plots] Random seed: 0xe11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd Simulating...

Context count : 2 Thread per context instance : 2 Memory used : 167.0MiB ( 0.2GiB ) Proofs / Challenges : 146 / 76 ( 192.11% ) Fetches / Challenges : 92 / 76 Filter bits : 512 Effective partials : 1 ( 1.09% ) Total fetch time elapsed : 30.000 seconds Average plot lookup time : 0.326 seconds Worst plot lookup lookup time : 6.282 seconds Average full proof lookup time: 6.282 seconds Fastest full proof lookup time: 6.282 seconds

compression | plot count | size TB | size PB

C5 | 12561 | 1096 | 1.10

bladebit_cuda alpha 4.4

bladebit_cuda -t 2 simulate --no-cuda -n 100 -p 2 --power 360 --size 200TB --seed e11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd plot-k32-c05-Test.plot [Simulator for harvester farm capacity for K32 C5 plots] Random seed: 0xe11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd Simulating...

Context count : 2 Thread per context instance : 2 Memory used : 167.0MiB ( 0.2GiB ) Proofs / Challenges : 146 / 76 ( 192.11% ) Fetches / Challenges : 92 / 76 Filter bits : 512 Effective partials : 1 ( 1.09% ) Total fetch time elapsed : 30.865 seconds Average plot lookup time : 0.335 seconds Worst plot lookup lookup time : 5.477 seconds Average full proof lookup time: 5.477 seconds Fastest full proof lookup time: 5.477 seconds

compression | plot count | size TB | size PB

C5 | 12208 | 1065 | 1.07

PhillWalker commented 1 year ago

Have tried working with empower to do some troubleshooting...

The issue persists when changing the version of Chiapos used, on a cloned repository of the 2.0 release branch.

Versions tried

chia pos 2.0.0rc1, 2.0.0b5, 2.0.0b4, 2.0.0b3

2.0.0b1 and 2.0.0b2 will not run as it displays this error

2023-08-04T08:04:07.159 harvester chia.plotting.manager • ERROR refresh callback raised: 'chiapos.DiskProver' obiect has n o attribute 'get compression level' with the traceback: Traceback (most recent call last): File "/chia-blockchain/chia/plotting/manager.py", line 241, in refresh task self. refresh callback(PlotRefreshEvents.batch processed, batch result) File "/chia-blockchain/chia/harvester/harvester.py", line 162, in _plot_refresh_callback self.plotsync sender .process batch (update result. loaded, update result.remaining) File "/chia-blockchain/chia/plot sync/sender.py", line 285, in process batch converted = convert plot info list loaded) File "/chia-blockchain/chia/plot sync/sender.py", line 48, in _convert _plot_info_list compression level=plot info.prover.get compression level (), AttributeError: 'chiapos .DiskProver' object has no attribute 'get compression level'

and the same with 1.0.12b16, which was the last known working version as far as we know, used in Alpha 4.3 build.

Alpha 4.3: chiapos=1.0.12b16 Alpha 4.4 (broken release): chiapos=2.0.0b1 Alpha 4.5: chiapos=2.0.0b2

Could be worth comparing what changed between chiapos 1.0.12b16 and 2.0.0b1, assuming the bug was introduced changing up to the major version, if chiapos is even at fault.

We also ran some tests changing to use "blspy==1.0.16", but the results were the same

Coyote-UK commented 1 year ago

image Update:

After updating my rig to 2.0.0rc3 these lines have appeared in my log file (as they have done for several versions)

2023-08-05T16:05:49.900 harvester chia.harvester.harvester: WARNING Looking up qualities on N:\C5\plot-k32-c05-2023-06-03-05-06-3fb805aaed23fcbd6629df024ba4268426f3aab4053142dfd8ace9bda4827702.plot took: 60.26634669303894. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:05:57.574 harvester chia.harvester.harvester: WARNING Looking up qualities on S:\C5\plot-k32-c05-2023-06-21-04-12-648f1cf9f7deb68b697ce5e7616b357a3ae6720a27b1755d412ed631658caef0.plot took: 59.11319851875305. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:21:46.164 harvester chia.harvester.harvester: WARNING Looking up qualities on M:\C5\plot-k32-c05-2023-05-31-11-20-aaf6dd611b6ed4c9b603d1169e27e736301a7bc0c548b8427750a694b1b2c483.plot took: 55.353450536727905. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:23:32.977 harvester chia.harvester.harvester: WARNING Looking up qualities on U:\C5\plot-k32-c05-2023-06-23-16-14-880df202463b5b3fb4f95e5825912ab7d98d206c3a1e6e015dc8500a3b5a1412.plot took: 105.28022837638855. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:23:54.091 harvester chia.harvester.harvester: WARNING Looking up qualities on R:\C5\plot-k32-c05-2023-06-15-11-03-b383db6587781130d747fea28beeae30818fad89c8b2520840b5adaa69337458.plot took: 79.3622350692749. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:25:08.232 harvester chia.harvester.harvester: WARNING Looking up qualities on J:\C5\plot-k32-c05-2023-07-28-14-14-09edcffacc140e67e7410808d33299100a2388041c69903d07b2657d67ae193d.plot took: 43.46801948547363. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:27:15.948 harvester chia.harvester.harvester: WARNING Looking up qualities on K:\C5\plot-k32-c05-2023-05-26-06-18-2103493ca705752950c6f83b2f40d5897ecd1338b49f8c88a152774290f4c20c.plot took: 41.67232918739319. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:32:36.251 harvester chia.harvester.harvester: WARNING Looking up qualities on O:\C5\plot-k32-c05-2023-06-06-07-06-63316b75708e944ced5e578ba028ec763c338c6a8378cef7139004f16fe7df48.plot took: 44.853880643844604. This should be below 5 seconds to minimize risk of losing rewards.

I was watching the Windows Task Manager for most of this time & at no point did the CPU reach 100% utilisation for more than 5seconds.

More pertinent to the overall performance impact. I have not managed to submit a single partial in <5s since the introduction of alpha 4.5. I was hitting about 50% <5s for compressed plots before that release.

PhillWalker commented 1 year ago

Issue persists in RC3, still see the late full proofs, and lots of random spikes for qualities warnings for non full proof lookups

Voodoo-Chia commented 1 year ago

I tried rolling back to version 4.3 (from 4.5) and then upgraded to RC3,

I have seen no noticabale timing differences between those three builds. Average is consistently around 1.5 seconds, with peaks in the 5-10 second range.

Farming 4900 C4 + 50 C5 plots with CPU

System: CPU: i3-6100 MB: Supermicro C7Z270-CG RAM: 1x 8GB DDR4

OS: Windows 10, 22H2, 19045.3208

Thought I'd add this for comparison, maybe it can help to root cause.

Coyote-UK commented 1 year ago

Main difference between Voodoo-Chia's environment & both mine & PhillWalker is that Voodoo-Chia has only 50 C5 plots, so statistically small chance of multiple plots passing filter for each challenge, whilst both the others have > 2,000 C5 plots so statistically high chance.

This MAY suggest the problem lies in parallel/sequential lookups for C5+ compression.

Voodoo-Chia commented 1 year ago

only 50 compressed plots,

4950 actually ;)

Coyote-UK commented 1 year ago

only 50 compressed plots,

4950 actually ;)

Sorry, corrected original post. Only 50 c5 plots

PhillWalker commented 1 year ago

My intel i3-2100 and i3-4130 seem OK and dont deliver lates.

Its the Intel Atom D2700s which have been really affected by these new releases the worst.

I can confirm these lates aren't caused by duplicate proofs at the same time or back to back ones.

PhillWalker commented 1 year ago

Issue persists in RC4

PhillWalker commented 1 year ago
Screenshot 2023-08-10 at 14 21 57 Screenshot 2023-08-10 at 14 21 42

Proof of lates over the last week attached, running RC3 and RC4

Coyote-UK commented 1 year ago

As there has been no activity on this for a while I bit the bullet and rolled my farm back to alpha 4.3. Same hardware, same plots, same pool, same config.yaml (with the 2 decompressor entries duplicated with the decompresser spelling) and this is the result: image If you can compare with the one above it clearly demonstrates a significant performance degradation as the versions progressed.

Average Latency - 2.6s on Alpha 4.1 & 3.4s on RC3 (an increase of 30%) Highest Latency - 26s on Alpha 4.1 & 105s (but I have seen over 500s recorded) on RC3 (an increase of 300%)

PhillWalker commented 1 year ago

I have rolled all 4 harvesters back to fc.compression branch (Alpha 1 I believe). All lates have gone away, best case plot lookup times are 5-7 seconds quicker compared to RC4. These harvesters were getting 20% lates on RC4, evenly across them. Now back to no lates...

Screenshot 2023-08-11 at 17 33 00
Coyote-UK commented 1 year ago

There appear to be 2 issues at play here. Both impact harvester performance, but I am not sure if one is a consequence of the other. (Increased average lookup times & sporadic excessive lookups).

However, 1 issue is clearly occasional unworkable lookup times. I have seen (in RC4) a lookup of over 500s seconds reported in the log. One instance of an excessively long lookup appears to have a lasting effect on the harvester, with output reduced for a period of time (estimated at up to an hour) after it occurs.

Any lookup is worthless if it cannot produce a partial in 30s or less, since most pools enforce 32s limits. Why not just drop any proof if it doesn't complete in say 20s & report the failure to the log?

This would at least minimise the impact of that element.

emlowe commented 1 year ago

Unfortunately, that is a rather long standing issue (see https://github.com/Chia-Network/chia-blockchain/issues/7788) and despite seeming relatively straightforward is more complicated than it appears.

But yes, this could be a part, maybe a large part of the problem

Coyote-UK commented 1 year ago

I've been following my hunch that it is linked to bad plots and am making progress.

I have aggressively removed plots that result in the following criteria from a chia plots check with default 30 iterations (remember ALL of these plots pass the basic plot check):

Plots with less than 16 Proofs found Plots with "Proof dropped due to line point compression" more than once Plots with "Quality doesn't match with proof" more than once Plots that report "Looking up qualities took:" longer than 5 seconds more than once Plots that report "Finding proof took:" longer than 30 seconds more than once

Replacing these plots (approximately 8% of my total farm after several iterations) has massively reduced the issue.

I am having to use the bladebit_cuda, plotter provided as part of alpha 4.3 as any later ones cannot be checked using the chia plots check command in alpha 4.3.

I want to be able to switch to alpha 4.3 to validate the performance difference still exists.

What has changed in the plot format since the alpha 4.3 plotter that would cause the check to fail? Does this suggest that I may need to completely replot to this newer plot format?

Is it possible that you are unable to reproduce the error in the farming software because the problem is initiated in imperfect plots & the farmer application is not handling the problem?

Plotting in Ubuntu 23.04 with 256GB RAM, Intel CPU, Nvidia Quadro M5000 GPU (CUDA 5.2, 8GB vRAM)

PhillWalker commented 1 year ago

Issue persists with 2.0 RC5, which uses ChiaPOS 2.0.1. Approx 20% late on the lower end hardware. No lates on fc.comprssion build for comparison.

PhillWalker commented 1 year ago

Issue persists with 2.0 RC6, which uses ChiaPOS 2.0.2. Approx 20% late on the lower end hardware. No lates on fc.comprssion build for comparison.

PhillWalker commented 1 year ago

Issue persists with final 2.0 public release. This is using the official chia docker image built by chia, rather than self built now it is available.

Coyote-UK commented 1 year ago

image Strangely I didn't see a single late partial submission whilst using RC6, but have already seen 2 late partial submissions on 2.0.0.

Did the 'timeout' introduced in RC5/6 not make it into the final release?

PhillWalker commented 1 year ago

Issue persists in 2.0.1

PhillWalker commented 1 year ago

Issue persists in 2.1.0

wjblanke commented 11 months ago

What CPU are you running on the Synology? Have u actually farmed a block successfully with alpha 1 version? We are wondering if it is fully correct or flawed and fast

PhillWalker commented 11 months ago

Intel Atom D2700

2 cores, 4 threads

wjblanke commented 11 months ago

Intel® SSE2, Intel® SSE3, Intel® SSSE3

No AVX. Maybe its an instruction set issue with compile time. Did u have luck with alpha version making blocks? I will look at the green reaper compiler settings

PhillWalker commented 11 months ago

I can confirm I won multiple blocks on this version across a 1.5PiB farm, on multiple harvesters.

PhillWalker commented 11 months ago

@Coyote-UK s CPU is i7-4500U for reference with the same issue

Coyote-UK commented 11 months ago

Confirmed: i7-4500U 16GB RAM SSD for DB, Win 10, 2472 C5 plots (created with alpha4), extensive pruning to remove bad plots, No blocks found between 3,798,055 & 4,155,045 (Equivalent of 2.91 x ETW) which covers ALL compressed plot farming up to the release of 2.0.0. I accept that statistically this is possible, but is a big coincidence.

wjblanke commented 11 months ago

it certainly sounds like there is a performance problem if you were able to win blocks with the old version. it will difficult for us to regression test this since the merge was so long ago. if there is any other information you can give us please let us know. i do appreciate the testing you have done so far.