Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

[Bug] Harvester is blocked while scanning for plots #9700

Closed cryptosven76 closed 2 years ago

cryptosven76 commented 2 years ago

What happened?

i have a fairly large setup, ~25000 plots on a single system and i try to get this working with a single chia instance. i have tried all kind of tuning of batch sizes and times in between batches , but either it takes hours to scan all the files or the farmer hangs when the scan takes more than 90 seconds between batches and stops working completely.

here is an example of where the scan of 500 files took a few seconds for >400 of them but then the scan got blocked for >200 seconds and that caused the harvester to fail :

2021-12-30T19:09:14.067 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-0/plot-k32-2021-07-30-16-17-9546068df8d6f59109fc3f12d14de071b25864b074f4bc478a7ecd24bb2185c1.plot took: 373.83388352394104. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:09:14.067 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-2/plot-k32-2021-12-22-10-33-2fad6f02abbc9ce5481751ea4638c7dbd55643bf8dba1bc19b5bb5152b3c442a.plot took: 373.8340108394623. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:09:14.067 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-1/plot-k32-2021-12-21-12-13-700f200782b56de6b33d42beb4694f7ff280cb203a159ebc474a61de889ed3bc.plot took: 373.8341381549835. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:09:14.067 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-2/plot-k32-2021-12-21-05-57-b69c3692f64c002f382e4362055ecb3083a0fc9acbda2c0be07b16629f96b5e9.plot took: 373.8342800140381. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:09:14.067 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-1/plot-k32-2021-08-10-02-26-12b20c9482099b6259205ee8d419350c5f87495317764c2ab54cde308e67f9e8.plot took: 373.83441138267517. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:09:14.067 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-3/plot-k32-2021-08-01-14-39-dfb47a29d3fd09b5bc10bf03760131041311efbed0d4e988f3dcafbf89f1dfc7.plot took: 373.83453941345215. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:09:14.068 harvester chia.harvester.harvester: INFO 21 plots were eligible for farming ab736b4c76... Found 0 proofs. Time: 373.83475 s. Total 10750 plots 2021-12-30T19:09:14.439 harvester chia.rpc.rpc_server : WARNING Sending data failed. Exception Traceback (most recent call last): File "/root/chia-blockchain/chia/rpc/rpc_server.py", line 72, in _state_changed await self.websocket.send_str(dict_to_json_str(payload)) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/client_ws.py", line 150, in send_str await self._writer.send(data, binary=False, compress=compress) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/http_websocket.py", line 687, in send await self._send_frame(message, WSMsgType.TEXT, compress) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/http_websocket.py", line 643, in _send_frame self._write(header + mask + message) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/http_websocket.py", line 660, in _write raise ConnectionResetError("Cannot write to closing transport") ConnectionResetError: Cannot write to closing transport . 2021-12-30T19:09:14.844 harvester chia.rpc.rpc_server : WARNING Sending data failed. Exception Traceback (most recent call last): File "/root/chia-blockchain/chia/rpc/rpc_server.py", line 72, in _state_changed await self.websocket.send_str(dict_to_json_str(payload)) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/client_ws.py", line 150, in send_str await self._writer.send(data, binary=False, compress=compress) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/http_websocket.py", line 687, in send await self._send_frame(message, WSMsgType.TEXT, compress) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/http_websocket.py", line 643, in _send_frame self._write(header + mask + message) File "/root/chia-blockchain/venv/lib/python3.7/site-packages/aiohttp/http_websocket.py", line 660, in _write raise ConnectionResetError("Cannot write to closing transport") ConnectionResetError: Cannot write to closing transport . 2021-12-30T19:09:15.742 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 38/64: CC: a94a62e698002d66ae26168685862d3cbb8e308a091dc739e5d616f8146c5e4c RC: 9a8f893844530163e85dff3000cb83457a4f49385beaff76ed6c19ca102cf007

i thought there was a new thread pool now available for the harvester, can't this be changed that the scan is done from a completely separate set of threads which are not processing proofs ?

on this example here you can see at the end of one of the batches all messages on the eligable plots come back at the same time, while there should be 10 seconds in between each :

2021-12-30T19:23:45.533 harvester chia.plotting.manager : INFO Found plot /plat-0/plot-k32-2021-12-21-06-04-9f15db987ad6a30ffc2f4fb1dcfb99816967cf9f0b3d6227c20614bc5c0241fb.plot of size 32 2021-12-30T19:23:46.053 harvester chia.plotting.manager : INFO Found plot /plats/plat-0/plot-k32-2021-12-22-19-41-9a113a62f93ff95d45e75ee34dba82fb7d59b2de0b509e98d7541e6588038964.plot of size 32 2021-12-30T19:23:46.574 harvester chia.plotting.manager : INFO Found plot /plats/plat-0/plot-k32-2021-12-23-12-38-a72e3e0831c9261bb8a6138cf79152e581f0fdd9435117950bdfe394cb0cff25.plot of size 32 2021-12-30T19:23:48.252 full_node chia.full_node.full_node: INFO Added unfinished_block b85590c79283390730cec4e035149d7f1bbfef8df755add3bab8d265c9166d11, not farmed by us, SP: 5 farmer response time: 2.6099, Pool pk xch1g7k3dtsr6ehv0z39za3e942z3fhredhmanq4fjw57squyp56aqvq642h0f, validation time: 0.0200 seconds, pre_validation time 0.0123, cost: 87324044, percent full: 0.794% 2021-12-30T19:23:48.719 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-0/plot-k32-2021-12-20-23-57-092591211a01ae7beb82b02dd9970a646db8c228e926de6919925937de1a0c8c.plot took: 148.5926070213318. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:23:48.732 harvester chia.harvester.harvester: WARNING Looking up qualities on /plats/plat-0/plot-k32-2021-12-21-05-40-5dc42fe9da2318e997df13a5fa2c1ec5b535b196ac7d4500505573412ee5129a.plot took: 148.60557007789612. This should be below 5 seconds to minimize risk of losing rewards. 2021-12-30T19:23:48.732 harvester chia.harvester.harvester: INFO 2 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 148.60623 s. Total 2000 plots 2021-12-30T19:23:48.772 harvester chia.harvester.harvester: INFO 5 plots were eligible for farming 38b1a038bc... Found 0 proofs. Time: 0.47334 s. Total 2000 plots 2021-12-30T19:23:48.777 harvester chia.harvester.harvester: INFO 4 plots were eligible for farming 38b1a038bc... Found 0 proofs. Time: 0.73036 s. Total 2000 plots 2021-12-30T19:23:48.783 harvester chia.harvester.harvester: INFO 3 plots were eligible for farming 38b1a038bc... Found 0 proofs. Time: 0.18529 s. Total 2000 plots 2021-12-30T19:23:48.795 harvester chia.harvester.harvester: INFO 5 plots were eligible for farming 38b1a038bc... Found 0 proofs. Time: 0.29530 s. Total 2000 plots 2021-12-30T19:23:48.796 harvester chia.harvester.harvester: INFO 7 plots were eligible for farming 38b1a038bc... Found 0 proofs. Time: 0.62313 s. Total 2000 plots 2021-12-30T19:23:48.806 harvester chia.harvester.harvester: INFO 4 plots were eligible for farming 38b1a038bc... Found 0 proofs. Time: 0.40742 s. Total 2000 plots 2021-12-30T19:23:48.811 harvester chia.harvester.harvester: INFO 7 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 0.88853 s. Total 2000 plots 2021-12-30T19:23:48.815 harvester chia.harvester.harvester: INFO 7 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 1.14715 s. Total 2000 plots 2021-12-30T19:23:48.857 harvester chia.harvester.harvester: INFO 2 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 1.06050 s. Total 2000 plots 2021-12-30T19:23:48.867 harvester chia.harvester.harvester: INFO 3 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 1.55475 s. Total 2000 plots 2021-12-30T19:23:48.895 harvester chia.harvester.harvester: INFO 6 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 2.11637 s. Total 2000 plots 2021-12-30T19:23:48.922 harvester chia.harvester.harvester: INFO 4 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 1.70785 s. Total 2000 plots 2021-12-30T19:23:48.948 harvester chia.harvester.harvester: INFO 3 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 1.40531 s. Total 2000 plots 2021-12-30T19:23:48.956 harvester chia.harvester.harvester: INFO 4 plots were eligible for farming 27aab2d0eb... Found 0 proofs. Time: 1.93793 s. Total 2000 plots

Version

1.2.11 but same with latest master

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

No response

stefan-lange commented 2 years ago

No real solution, but a workaround. If you are familar with docker you could create multiple harvesters on the same machine. Each of it with different plot folders assigned. This should speed up initial start.

It should also work without docker and different app folders.

xdustinface commented 2 years ago

or the farmer hangs when the scan takes more than 90 seconds between batches and stops working completely.

@cryptosven76 This ^ together with the thrown exception in rpc server's _state_changed sounds like this is related to the harvester cache updates. We are aware that this can lead to trouble on large farms (besides other stuff i guess) but we are on it and i have a PR open to solve at least the caching related problem, see https://github.com/Chia-Network/chia-blockchain/pull/9695. You could try this PR and play with the refresh parameters to get it working better. But keep in mind that this needs to be installed on the farmer and the harvester if your harvester is remote. I would also be interested to get your debug logs with DEBUG log level enabled. If you don't mind you can send them on keybase to https://keybase.io/dustinface.

cryptosven76 commented 2 years ago

No real solution, but a workaround. If you are familar with docker you could create multiple harvesters on the same machine. Each of it with different plot folders assigned. This should speed up initial start.

It should also work without docker and different app folders.

Hi Stefan-lange, i am very aware of this setup, thats what i am running on my other systems, but this comes with other scaling challenges. just out of curiosity what is the number of plots you have per harvester ? it seems that everything up to ~5k plots works , as soon as you exceed that number you run in all kind of scaling issues per instance.

cryptosven76 commented 2 years ago

@xdustinface i am happy to try this out, is there a particular branch i can checkout and try ? not sure if applying 11 patches manually will work well. its all on one node, so as long as i get a commit or branch i can checkout i cn try this.

xdustinface commented 2 years ago

@xdustinface i am happy to try this out, is there a particular branch i can checkout and try ? not sure if applying 11 patches manually will work well. its all on one node, so as long as i get a commit or branch i can checkout i cn try this.

Yeah sure just the PR's branch (https://github.com/Chia-Network/chia-blockchain/tree/pr-protocol-plot-sync). FYI, you can see it always at the top of the PR site like here xdustinface wants to merge 11 commits into main from pr-protocol-plot-sync.

stefan-lange commented 2 years ago

No real solution, but a workaround. If you are familar with docker you could create multiple harvesters on the same machine. Each of it with different plot folders assigned. This should speed up initial start. It should also work without docker and different app folders.

Hi Stefan-lange, i am very aware of this setup, thats what i am running on my other systems, but this comes with other scaling challenges. just out of curiosity what is the number of plots you have per harvester ? it seems that everything up to ~5k plots works , as soon as you exceed that number you run in all kind of scaling issues per instance.

I have only a tiny farm at this moment with ~2.200 plots. One PI with node, farmer and 1 harvester (separate containers) and 1 harvester on the plotter for new plots until I move them. Everything works in docker swarm. The plan months go was to connect different locations via VPN, all in docker swarm with special health checks, central monitoring and configuration as code. That simplifies maintenance/updates on my side in the long term.

Looking forward to a solution to your problem, which I might encounter in a few months.

github-actions[bot] commented 2 years ago

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

github-actions[bot] commented 2 years ago

This issue was automatically closed because it has been flagged as stale, and subsequently passed 7 days with no further activity from the submitter or watchers.