Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

[Bug] Issues with harvester with hundreds of disks #11225

Closed Ntohpn closed 2 years ago

Ntohpn commented 2 years ago

What happened?

I hope the government will pay attention. Now there are many large-scale farmers who are becoming no-node farmers, because the efficiency of the harvester is too low. For example, I have 100-200 hard disks on a computer. If I only open one harvester, the harvester will not work. It keeps restarting with errors. Efficiency is slow. Even without computing power output. So I chose hpool and became a nodeless miner with the disk sweeping tool they provided. This goes against my blockchain beliefs, but the reality is that the harvester is too difficult to use. Is it possible to consider that multiple harvesters can be opened on one computer, so that one harvester is only responsible for part of the disk?

Version

1.3.4

What platform are you using?

Windows

What ui mode are you using?

GUI

Relevant log output

No response

jmhands commented 2 years ago

we have profiled 300+ drives on a single harvester instance with consistent lookup times of 1.2 seconds. There are multiple improvements coming with commits already in main branch, but the workaround for now is to set plot loading frequency very high if you are not plotting anymore

haun2022 commented 2 years ago

绘图加载频率设置得非常高,在什么地方设置

Ntohpn commented 2 years ago

绘图加载频率设置得非常高,在什么地方设置

+1

Ntohpn commented 2 years ago

我们已经在单个收割机实例上分析了300多个驱动器,一致的查找时间为1.2秒。主分支中已有的提交带来了许多改进,但现在的解决方法是,如果您不再进行绘图,则将绘图加载频率设置得非常高

具体是需要怎么做?安装1.3.4可以解决吗?

Ntohpn commented 2 years ago

建议收割机的设计上能充分利用CPU的资源来读取硬盘,放着大把的系统资源无法利用,收割100张磁盘就开始效率低下。

emlowe commented 2 years ago

We have some more improvements coming in a later release after 1.3.4 - what JM was suggesting was to set the following in config.yaml:

harvester:
  plots_refresh_parameter:
    interval_seconds: 120 # The interval in seconds to refresh the plot file manager
    retry_invalid_seconds: 1200 # How long to wait before re-trying plots which failed to load
    batch_size: 300 # How many plot files the harvester processes before it waits batch_sleep_milliseconds
    batch_sleep_milliseconds: 1 # Milliseconds the harvester sleeps between batch processing

You can set the interval_seconds to a very large number like 86400 which will refresh the plot list 1x a day. Obviously, if you are actively plotting, this means new plots will show up only after a day. You probably don't need to adjust the other parameters.

Ntohpn commented 2 years ago

我们在 config.3.4 之后的版本中提供了更多改进 - 我们建议在 config.3.3 中设置以下内容:

harvester:
  plots_refresh_parameter:
    interval_seconds: 120 # The interval in seconds to refresh the plot file manager
    retry_invalid_seconds: 1200 # How long to wait before re-trying plots which failed to load
    batch_size: 300 # How many plot files the harvester processes before it waits batch_sleep_milliseconds
    batch_sleep_milliseconds: 1 # Milliseconds the harvester sleeps between batch processing

您可以为一个大的数字,后一天interval_seconds86400,每天你每天调查设置一个非常不错的示例列表。调整其他参数。

希望可以进一步优化收割机性能

emlowe commented 2 years ago

Closing issue, we believe this is resolved by proper tweaking of interval_seconds. However, further improvements are already in the pipeline (see #11204 and #9903)

Ntohpn commented 2 years ago

尝试过了这个确实无法解决收割机的问题。

Ntohpn commented 2 years ago

关闭问题,我们相信这可以通过适当调整 interval_seconds 来解决。但是,进一步的改进已经在进行中(参见#11204和#9903)

然而并无法解决,170个驱动器,2万多张图,收割机就会罢工。

grobalt commented 2 years ago

Currently it is horrible. I have a node with 7 JBOD each 106 disks and one remote harvester with 106 disks. Currently the farmer is just using the remote 106 disks as the local discs are get kicked out after a few hours. Restarting the service without any problems, but some hours later chia is just using the remote plots and drops all local plots. Reponse times are good initially, but after a few hours the response time is around a minute. The idea was to be energy efficient, one farmer (still a 64core epyc) with multiple JBOD ....

Ntohpn commented 2 years ago

目前这很可怕。我有一个节点,每个节点有 7 个 JBOD,每个 106 个磁盘和一个远程收割机,有 106 个磁盘。目前农民只使用远程 106 磁盘,因为本地磁盘在几个小时后被踢出。重新启动服务没有任何问题,但几个小时后 chia 只是使用远程地块并丢弃所有本地地块。 最初的响应时间很好,但几个小时后响应时间约为一分钟。 这个想法是要节能,一个农民(仍然是 64 核 epyc)具有多个 JBOD ......

我就是因为各种问题被逼成无节点矿工的

grobalt commented 2 years ago

目前这很可怕。我有一个节点,每个节点有 7 个 JBOD,每个 106 个磁盘和一个远程收割机,有 106 个磁盘。目前农民只使用远程 106 磁盘,因为本地磁盘在几个小时后被踢出。重新启动服务没有任何问题,但几个小时后 chia 只是使用远程地块并丢弃所有本地地块。 最初的响应时间很好,但几个小时后响应时间约为一分钟。 这个想法是要节能,一个农民(仍然是 64 核 epyc)具有多个 JBOD ......

我就是因为各种问题被逼成无节点矿工的

sorry, even google translate does not help .. please answer in english (or german :-p )

Ntohpn commented 2 years ago

已经看到后续的优化了,目前正在使用1.3.6测试版

grobalt commented 2 years ago

Still happening .... 2-3 days stable, better than before, but still losing local plots (ca 700 18tb Disks)

Ntohpn commented 2 years ago

仍在发生.... 2-3 天稳定,比以前好,但仍然丢失本地地块(约 700 18tb 磁盘)

还需等待官方持续改进此问题