Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.81k stars 2.02k forks source link

[Bug] plots beyond ~4400 = harvester 100.0 load, cache_hit: false, plots check hangs before challenges #13084

Closed malventano closed 2 years ago

malventano commented 2 years ago

What happened?

Noted that for the last few releases, chia_harvester was pegging a thread continuously while farming.

Info:

Elaboration:

Possible causes:

Version

1.5.0

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

No response

malventano commented 2 years ago

Updated to 1.5.1 and cleared all settings, starting clean.

wallentx commented 2 years ago

Can you try to delete ~/.chia/mainnet/cache/plot_manager.dat, and restart the harvester? This will repopulate the plot location cache, but note that on the initial discovery of all your plots, this may take a while to complete. All following startups will use the cache and should not experience the same load.

malventano commented 2 years ago

I've tried that multiple times while troubleshooting here, including during the update to 1.5.1 where I cleared the ~/.chia/mainnet folder (only imported blockchain and blockchain_wallet db's and re-added plot_directories to config). The subsequent startup does take a bit longer, and it recreates the cache file as expected, but all other symptoms remain identical.

Regardless of time farming, or even repeated chia plots check operations, affected plots (most of them) still show as cache_hit: False in logs and chia_harvester continues running at 100-110 load while farming (never below 100.0).

Re: running a chia plots check, even if I -g it to filter down to a single plot, after it finishes loading plots, if that plot showed as a cache hit: False, then the check operation will hang for several minutes before issuing challenges. If I filter only to plots that return cache_hit: True then challenges will begin immediately after the plots are loaded.

zhangyongchang88 commented 2 years ago

我也有类似问题,自从使用了1.5.0新版本和部分1.5 FORK,收割机的CPU都变成了%100,以前小于 %20

image image 这不是同一台机器。这是不同的机器。

zhangyongchang88 commented 2 years ago

image 可以看到Chia和2个Fork 占用很高。这三个均为1.5.0的版本,旧的版本占用很低。

zhangyongchang88 commented 2 years ago

image 我的每台机器plot数量4000+

zhangyongchang88 commented 2 years ago

我的系统每次启动起来是自动删除 C:\Users\Administrator.*文件夹。不存在提到的缓存问题。

malventano commented 2 years ago

我的系统每次启动起来是自动删除 C:\Users\Administrator.*文件夹。不存在提到的缓存问题。

It could still be a caching-related issue since it would create a new cache on the next startup (and the cache is then used while the harvester runs). Either way, we won't know unless we can figure out a way to tell what those pegged harvester threads are doing.

zhangyongchang88 commented 2 years ago

为什么新版本占用cpu过高?

xpuska777 commented 2 years ago

Same problem, harvester CPU 100%, Debian, GUI. This happens not only when the number of plots is over 4k. I have several hard drives with rafts and the problem appears when attaching certain rafts to the harvester. Since I mixed plotting using "Chia Proof of space v1.0", "Bladebit also an early version" and "MadMAX plotter", I have a hunch that this is due to some particular plotter, although I may be wrong.

malventano commented 2 years ago

All of my plots should be BladeBit replots (for pooling), so that shouldn't be the issue.

maxcom99 commented 2 years ago

4078 bladebit plots, same hang when trying to check plots even with -n 5 -- waited several hours, nothing happened. broken up to directories of ~100 plots if relevant. 1.5.1.dev

xdustinface commented 2 years ago

@malventano Can you please provide some logs with DEBUG log level?

Also:

still see cache_hit: false on a large portion of plots.

Not sure if i understand how you mean this. Do you see this on each refresh event for the same plots while the harvester is running or is it only after start and for random plots or whats going on there exactly?

malventano commented 2 years ago

@malventano Can you please provide some logs with DEBUG log level?

I poked around DEBUG level logs but nothing obvious stood out. Which activities would you like DEBUG logs for? Harvester startup with/without plot_manager.dat present? While attempting a plots check?

Not sure if i understand how you mean this. Do you see this on each refresh event for the same plots while the harvester is running or is it only after start and for random plots or whats going on there exactly?

I see these events in log during harvester start and when attempting to do a plots check. Example:

2022-09-14T17:55:08.360  chia.plotting.cache              : INFO     Loaded 28925466 bytes of cached data
2022-09-14T17:55:08.669  chia.plotting.check_plots        : INFO     event: started, loaded 0 plots, 24286 remaining
2022-09-14T17:55:08.772  chia.plotting.manager            : INFO     Only loading plots that contain "/mnt/e" in the file or directory name
2022-09-14T17:55:08.784  chia.plotting.check_plots        : INFO     event: batch_processed, loaded 0 plots, 23986 remaining
2022-09-14T17:55:08.784  chia.plotting.manager            : INFO     Only loading plots that contain "/mnt/e" in the file or directory name
(...)
2022-09-14T17:55:09.104  chia.plotting.check_plots        : INFO     event: batch_processed, loaded 0 plots, 13486 remaining
2022-09-14T17:55:09.104  chia.plotting.manager            : INFO     Only loading plots that contain "/mnt/e" in the file or directory name
2022-09-14T17:55:09.201  chia.plotting.manager            : INFO     Found plot /mnt/e01/plot-k32-2021-08-20-06-17-008935afe5dc515551913458dc6690be94f7b2976db134928a558722
52213646.plot of size 32, cache_hit: False
2022-09-14T17:55:09.220  chia.plotting.manager            : INFO     Found plot /mnt/e01/plot-k32-2021-08-26-19-39-592674758f32cef423d4d689d6130601ba9b4dda876addc57b6aa8d3
8851c611.plot of size 32, cache_hit: False
2022-09-14T17:55:09.225  chia.plotting.manager            : INFO     Found plot /mnt/e01/plot-k32-2021-08-26-20-51-f2ffde52daa51eaa90925aec44ad80280b4e6072549788b778d6eda3
5fda08bb.plot of size 32, cache_hit: False
2022-09-14T17:55:09.230  chia.plotting.manager            : INFO     Found plot /mnt/e01/plot-k32-2021-08-20-07-10-f0e01eb38bbbace2205dbf92e6d3bc02b06d9038c1009219363ffc65
4c472eca.plot of size 32, cache_hit: False
2022-09-14T17:55:09.247  chia.plotting.manager            : INFO     Found plot /mnt/e01/plot-k32-2021-08-26-18-28-2fdd05f2542dad6ab1e431e9f75634ac626356b397c3ed967f6920b2
30c419a6.plot of size 32, cache_hit: False
2022-09-14T17:55:09.277  chia.plotting.manager            : INFO     Found plot /mnt/e01/plot-k32-2021-08-26-10-24-97b3a3db56174d60a62083d39f4be3450f1e6ecfeeb131e51b4cc4a5
576c2078.plot of size 32, cache_hit: False

...as compared with a check against a range that was cached properly:

2022-09-14T17:52:46.974  chia.plotting.cache              : INFO     Loaded 3597591420 bytes of cached data
2022-09-14T17:52:58.113  chia.plotting.check_plots        : INFO     event: started, loaded 0 plots, 24286 remaining
2022-09-14T17:52:58.222  chia.plotting.manager            : INFO     Only loading plots that contain "/mnt/d27" in the file or directory name
2022-09-14T17:52:58.233  chia.plotting.check_plots        : INFO     event: batch_processed, loaded 0 plots, 23986 remaining
2022-09-14T17:52:58.234  chia.plotting.manager            : INFO     Only loading plots that contain "/mnt/d27" in the file or directory name
(...)
2022-09-14T17:52:58.344  chia.plotting.check_plots        : INFO     event: batch_processed, loaded 0 plots, 19786 remaining
2022-09-14T17:52:58.344  chia.plotting.manager            : INFO     Only loading plots that contain "/mnt/d27" in the file or directory name
2022-09-14T17:52:58.349  chia.plotting.manager            : INFO     Found plot /mnt/d27/plot-k32-2021-08-08-05-46-ef9240da867fd01e48d817f6e1ddf0517b703754863a302f67422d82
42edcfea.plot of size 32, cache_hit: True
2022-09-14T17:52:58.349  chia.plotting.manager            : INFO     Found plot /mnt/d27/plot-k32-2021-08-08-01-16-78a37f51777b07f9d5f0d395cc0c987e79ce0581d6b14fdf5e4ffd4f
6abf9d02.plot of size 32, cache_hit: True
2022-09-14T17:52:58.350  chia.plotting.manager            : INFO     Found plot /mnt/d27/plot-k32-2021-08-08-05-56-8c22aa8c3118e76d1a6e4bba3cecb961b6872601eb40b1e0bd4f938d
ead70072.plot of size 32, cache_hit: True
2022-09-14T17:52:58.350  chia.plotting.manager            : INFO     Found plot /mnt/d27/plot-k32-2021-08-08-06-16-0cdab955ad3f372d27dd261f47c38f33e6a473e63c4b12c164b367d1
99bc4d59.plot of size 32, cache_hit: True
2022-09-14T17:52:58.350  chia.plotting.manager            : INFO     Found plot /mnt/d27/plot-k32-2021-08-08-06-26-647c3e28d82478748889b0484455f004cc04f97d52ed776156e4197c
61c17688.plot of size 32, cache_hit: True
2022-09-14T17:52:58.350  chia.plotting.manager            : INFO     Found plot /mnt/d27/plot-k32-2021-08-08-06-26-978e931b10552fba6285214ae7de3142e9f43d1551e27cbd1b78d171
145df96d.plot of size 32, cache_hit: True

They do not repeat during refresh events, but I did note these other errors appearing in INFO level logs at regular intervals - could be related:

2022-09-14T09:16:30.118 harvester chia.plotting.cache     : ERROR    Failed to save cache: Value 5794656522 does not fit into uint32, Traceback (most recent call last):
2022-09-14T09:18:32.225 harvester chia.harvester.harvester: INFO     _plot_refresh_callback: event done, loaded 0, removed 0, processed 24286, remaining 0, duration: 1.02 seconds, total plots: 24286
2022-09-14T10:58:45.556 harvester chia.plotting.cache     : ERROR    Failed to save cache: Value 5794656522 does not fit into uint32, Traceback (most recent call last):
2022-09-14T11:00:47.660 harvester chia.harvester.harvester: INFO     _plot_refresh_callback: event done, loaded 0, removed 0, processed 24286, remaining 0, duration: 0.88 seconds, total plots: 24286
2022-09-14T12:42:38.391 harvester chia.plotting.cache     : ERROR    Failed to save cache: Value 5794656522 does not fit into uint32, Traceback (most recent call last):
2022-09-14T12:44:40.528 harvester chia.harvester.harvester: INFO     _plot_refresh_callback: event done, loaded 0, removed 0, processed 24286, remaining 0, duration: 0.99 seconds, total plots: 24286
2022-09-14T14:25:30.126 harvester chia.plotting.cache     : ERROR    Failed to save cache: Value 5794656522 does not fit into uint32, Traceback (most recent call last):
2022-09-14T14:27:32.312 harvester chia.harvester.harvester: INFO     _plot_refresh_callback: event done, loaded 0, removed 0, processed 24286, remaining 0, duration: 0.98 seconds, total plots: 24286
xdustinface commented 2 years ago
2022-09-14T09:16:30.118 harvester chia.plotting.cache     : ERROR    Failed to save cache: Value 5794656522 does not fit into uint32, Traceback (most recent call last):

Can you also post the traceback which comes after this line ^

malventano commented 2 years ago

Can you also post the traceback which comes after this line ^

Oh, sorry, I was tailing into a grep for harvester for that. Here's the most recent one in my log:

2022-09-14T17:55:46.871 harvester chia.plotting.cache     : ERROR    Failed to save cache: Value 5794656522 does not fit into uint32, Traceback (most recent call last):
  File "/root/chia-blockchain/chia/plotting/cache.py", line 121, in save
    serialized: bytes = bytes(disk_cache)
  File "/root/chia-blockchain/chia/util/streamable.py", line 616, in __bytes__
    self.stream(f)
  File "/root/chia-blockchain/chia/util/streamable.py", line 602, in stream
    field.stream_function(getattr(self, field.name), f)
  File "/root/chia-blockchain/chia/util/streamable.py", line 434, in stream_bytes
    write_uint32(f, uint32(len(item)))
  File "/root/chia-blockchain/chia/util/struct_stream.py", line 67, in __init__
    raise ValueError(f"Value {self} does not fit into {type(self).__name__}")
ValueError: Value 5794656522 does not fit into uint32
xdustinface commented 2 years ago

Thanks, this helps 👍

malventano commented 2 years ago

Thanks, this helps 👍

Of course. Just remember I'm not sure if that error is related to the CPU use issue, but here's hoping! :)

xdustinface commented 2 years ago

Can you post the full debug.log either here or send it to me or on keybase http://keybase.io/dustinface?

xdustinface commented 2 years ago

Okay so.. turned out that the reason for all this are plots created via bladebit RAM plotter where the DiskProver serializes into 524.659 bytes which:

The reason why the DiskProver serializes into such a huge blob is that those plots seem to have 65.536 C2 entries.

Table pointers from a plot in question with table_begin_pointers[10] - table_begin_pointers[9] -> 262.144:

table_begin_pointers = {std::vector<unsigned long long>} size=11
 [0] = {unsigned long long} 0
 [1] = {unsigned long long} 262144
 [2] = {unsigned long long} 14839185408
 [3] = {unsigned long long} 28822208512
 [4] = {unsigned long long} 42911924224
 [5] = {unsigned long long} 57272958976
 [6] = {unsigned long long} 72367734784
 [7] = {unsigned long long} 89824165888
 [8] = {unsigned long long} 107538284544
 [9] = {unsigned long long} 107540119552
 [10] = {unsigned long long} 107540381696

Table pointers from a normally working plot with table_begin_pointers[10] - table_begin_pointers[9] -> 176:

table_begin_pointers = {std::vector<unsigned long long>} size=11
 [0] = {unsigned long long} 0
 [1] = {unsigned long long} 252
 [2] = {unsigned long long} 14839436976
 [3] = {unsigned long long} 28822365051
 [4] = {unsigned long long} 42911861451
 [5] = {unsigned long long} 57273202401
 [6] = {unsigned long long} 72368924901
 [7] = {unsigned long long} 89827257426
 [8] = {unsigned long long} 107543532882
 [9] = {unsigned long long} 107545250830
 [10] = {unsigned long long} 107545251006

Im going to talk with @harold-b about this and will post an update once we figured this out.

maxcom99 commented 2 years ago

thank you! cant wait to test!