Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

[Bug] 2.0.0-rc3 silently stops sending partials to a pool after a few minutes if a compressed plot is present #15949

Closed TheSpearman closed 1 year ago

TheSpearman commented 1 year ago

What happened?

Running 2.0.0-rc3 on Debian 12. The client starts normally (though it does complain it can't initialise GPU compression), it starts sending partials to the pool as normal. If any compressed plots are present in any of the locations, it will silently stop sending partials to the pool after about 10 minutes, and doesn't recover. Externally the processes seem fine, all of them respond, blockchain and wallet stay synced. If I remove the compressed plots and restart, it continues on normally without issue. There's no hardware issues, no kernel log errors, Flexfarmer can continue without issue, albeit by ignoring the compressed plots.

The compressed plots were made with Bladebit Cuda v3-rc1, also on Debian 12.

This can be reproduced without fail every time. If I start it with uncompressed only plots, and then add the compressed ones location, it will silently fail, again after about 10 minutes.

This occurred on rc2 as well.

The logs don't show anything immediately obviously of use.

I've done completely fresh installs of the client from source and .deb package, both show the same, as well as starting with a completely fresh config.yaml and freshly added keys.

Version

2.0.0-rc3

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

The logs don't show anything immediately obviously of use.
AtomicInternet2 commented 1 year ago

Same problem on Windows 11 CPU Harvesting 2.0.0r3 - Clean OS Install

I can verify this happens on a clean Windows 11 install as well using CPU harvesting with C5 compression. This is on a fresh reload of OS and install of 2.0.0r3 client (no previous Chia client instances ever existed).

Ran for 7 hours with 4410 uncompressed (standard) plots perfectly fine. Added a drive of 205 compressed C5 plots and harvester stopped reporting 10 minutes after drive was added.
No errors in log. Restarted harvester and continued to report for another 15 minutes, then went silent again.
I now monitor logs for absence of "plots were eligible for farming" and restart harvester every 10-15 minutes.

Full System Details: http://atomicinternet.homeip.net/crypto/ Motherboard: MSI B450-A PRO MAX Platform: Custom Built Windows 11 x64 Processor name: Ryzen 7 5700G Memory configuration: 32GB DDR4 3200 (2GB reserved for iGPU) NVIDIA driver version: N/A Full GPU details: Radeon Vega 8 iGPU (integrated graphics)

Magicwalker commented 1 year ago

Same problem on Windows 11 CPU Harvester 2.0.0r3 & Full node 2.0.0r3 same version. Other Harvester mix plot solo & nft plot don't have issue Only issue happen on mix with nftplot + solo plot + c7 compression plot suspect issue was on Full Node 2.0.0r3 , coz i downgrade to 2.0.0r2 for problematic harvester getting same result

AtomicInternet2 commented 1 year ago

I created a powershell script that runs every minute to restart the harvester if no partials for anyone who needs it. Happy farming. `$partialsText = Get-Content -Path "~.chia\mainnet\log\debug.log" -tail 50 if (-Not ($partialsText -like "plots were eligible for farming")) {

restart harvester

cd ~\AppData\Local\Programs\Chia\resources\app.asar.unpacked\daemon
.\chia.exe start harvester -r

} `

bramv-chia commented 1 year ago

@TheSpearman as an experiment could you try the following:

Does the same partial stoppage still occurs after some time?

AtomicInternet2 commented 1 year ago

I am CPU farming only, I have no CUDA devices on this machine.
However, a new exciting development: I set the following config settings and restarted my harvester over 3 hours ago. Appears to be stable now. My script has initiated no harvester restarts since 10:30am EST and still going strong. No lates to the pool either. However, only 205 of my 4205 plots on this farm are C5 compressed. The rest are uncompressed, so I'll see how this develops.

decompressor_thread_count: 8 parallel_decompressor_count: 1

Anyone else following this try the config above and restart your harvester.

emlowe commented 1 year ago

For CUDA issues, you could try using the bladebit_cuda simulate option to test this and help isolate the problem:

The command is bladebit_cuda simulate -s <farm_size> --power <simulation_seconds> <path_to_plot>

Example: bladebit_cuda simulate -s 1PiB --power 30 ~/plots/my.plot

emlowe commented 1 year ago

@AtomicInternet2 Were you previously just running without those in config.yaml at all - you didn't have any settings related to decompressor in the config?

The defaults I believe actually don't do any decompression at all, so your compressed plots would be read as uncompressed and this might cause all kinds of side effects

AtomicInternet2 commented 1 year ago

@AtomicInternet2 Were you previously just running without those in config.yaml at all - you didn't have any settings related to decompressor in the config?

The defaults I believe actually don't do any decompression at all, so your compressed plots would be read as uncompressed and this might cause all kinds of side effects

No. Before the change I had the default values of 0.

decompressor_thread_count: 0 parallel_decompressor_count: 0

I read by default they take half your cores for thread count, and half of that for parallell decompressor count. I'm going to try upping the parallell decompressor to 2 and watch to see what happens.

The harvester UI was reporting 205 C5 compressed plots with both settings, but maybe that was just the UI and not what was actually happening. The chia farm summary did report all 4205 plots though.

AtomicInternet2 commented 1 year ago

Added another drive of C5 plots and upped the parallell_decompressor_count to 2. I'll see what happens in an hour.

According to UI: Total plots: 4466Total OG: 0Total plotNFT: 4466 Plot Sizes K32 4466 100% Compression C0 4058 91% C5 408 9%

According to CLI .\chia.exe farm summary Farming status: Farming Total chia farmed: 11.750427916508 User transaction fees: 0.000427916508 Block rewards: 11.75 Last height farmed: 4055267 Local Harvester 4466 plots of size: 433.939 TiB on-disk, 442.239 TiBe (effective) Plot count for all harvesters: 4466 Total size of plots: 433.939 TiB, 442.239 TiBe (effective) Estimated network space: 26.922 EiB Expected time to win: 2 weeks Note: log into your key using 'chia wallet show' to see rewards for each key

TheSpearman commented 1 year ago

@bramv-chia

Running bladebit_cuda simulation shows no issue. So the machine is fine with driver and CUDA.

With some experimentation, with and without parallel_decompressor_count set:

Leaving at default of 0, causes GPU initialization to fail. It states to be falling back to CPU decompression instead.

Setting to 1, and I can see GPU compression working.

I've just tested at 0 again, and as before partials stop after about 10 minutes or so.

Even if the default is 0, and compression can't be used, it shouldn't stop partials from the uncompressed plots.

AtomicInternet2 commented 1 year ago

CPU: Ryzen 7 5700G with CPU harvesting only (No CUDA GPU present in system)

Added more compressed plots and changed parallell_decompressor_count to 2, still solid as a rock for 1.5 hours. I'm guessing there's a memory issue with whatever the default 0 values use. I haven't had a single restart since switching these values off default of 0.

C0 4058 91% C5 408 9%

decompressor_thread_count: 8 parallel_decompressor_count: 2

TheSpearman commented 1 year ago

I've been running for the last two hours now with GPU decompression working, actively adding more compressed plots during that time, all without issue.

emlowe commented 1 year ago

To clarify setting decompressor_thread_count: 0 is basically like saying "I have no compressed plots and I don't want any"

If you DO have a compressed plot and have decompressor_thread_count: 0 that plot will NEVER be read properly for proofs and this probably puts the harvester into a bad state.

We expect a fix for this issue shortly.

bramv-chia commented 1 year ago

The fix is expected to land into the RC5 build.