Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

[Bug] bladebit 3.0.0 ramplot crashes plotting c3 plots #16219

Open jayhohoho2019 opened 1 year ago

jayhohoho2019 commented 1 year ago

What happened?

Please see https://github.com/Chia-Network/bladebit/issues/389 for details.

ramplot, no cuda, no diskplot. happens for c3 as well as c5, and possibly other c levels.

Version

2.0.0

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

~ Half the time:

Finished Phase 2 in 26.67 seconds.
Running Phase 3
  Compressing tables 2 and 3...
STDERR:

STDERR: Fatal Error:

STDERR: Overran park buffer: 6358 / 6352

----
The other half time:
Running Phase 3
Compressing tables 2 and 3...
STDERR: *** Crashed! ***

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z12CrashHandleri+0xaa)[0x55a6e54e3eda]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f425d036520]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x1afbba)[0x7f425d1a3bba]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z15WriteParkThreadP12WriteParkJob+0x218)[0x55a6e564f1f8]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_ZN10ThreadPool17FixedThreadRunnerEPv+0x52)[0x55a6e5659822]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(ZN6Thread17ThreadStarterUnixEPS+0x80)[0x55a6e54e4f90]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f425d088b43]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f425d11aa00]
jayhohoho2019 commented 1 year ago

What should I do to get some movement on this bug? To plot a 14TB drive of c3 plots I've had to restart it dozens of times...

jayhohoho2019 commented 1 year ago

Please note I'm running this with CLI. No GUI involved.

jayhohoho2019 commented 1 year ago

Tried using bladebit directly rather than chia plotters: bladebit -t '"$r"' -c '"$c"' -f '"$f"' -n '"$n"' -v -z '"$compress"' ramplot '"$dst"' where -t 85, -n 20

It worked until n=12, and again crashed here: Running Phase 3 Compressing tables 2 and 3...

jayhohoho2019 commented 1 year ago

Tried the binary from bb 3.11-beta1 with the same arguments. Same result. Best case crashed when n=19, worst case n=3.

jayhohoho2019 commented 1 year ago

With 3.11-beta1, had an instance where it crashed working on n=1. Always at this stage: Running Phase 3 Compressing tables 2 and 3...

wjblanke commented 1 year ago

Harold this seems bad

STDERR: Overran park buffer: 6358 / 6352

harold-b commented 1 year ago

Park overrun can happen in some instances. We increased the size from the minimum and created tons of plots until it wasn't happening, but there's no guarantee it cannot. Some park sizes chose for certain levels might trigger more than others. We crash on purpose when this happens since we don't know what memory might have been touched that shouldn't have.

We might be able to see if we can increase the buffer size used for park writing and then not crash, but ignore the plot, if it overran within the bounds of the buffer allocated for the parks.

jayhohoho2019 commented 1 year ago

Is this specific to compressed plots? I have never had this issue with bb1 or bb2, on this very same plotting computer/harvester. But now I'm hitting it every few plots.

harold-b commented 1 year ago

Yes, each compression level has new park sizes which are different than the park sizes for uncompressed plots.

Even though it could happen with classic (uncompressed) plots, the park sizes much more generous as to it nearly never happening.

jayhohoho2019 commented 1 year ago

I've run this for c3 plots so far. Crashes way too often, sometimes at the very first plot, at the most at the 19th plot. Would take me years to finish replotting my farm unless this gets fixed :-)

jayhohoho2019 commented 1 year ago

It has nothing to do with the number of threads (-t) value, correct?

harold-b commented 1 year ago

Threads won't affect anything, park sizes are fixed. But as a workaround you might try a different compression level that might not be triggering overrruns

jayhohoho2019 commented 1 year ago

Most of my remote harvesters are RPi4s and C3 is what I was told the "right" c level for it. Each RPi4 is hooked up to 600TB. I'm not sure if its CPU can handle a higher C level at this size.

jayhohoho2019 commented 1 year ago

I did a run for c5 n=21 and it completed without a crash. But I need c3 though for most of my harvesters that are RPi4s (about 6PiB). Actually I need to plot about 600TB of c3 plots for 1 RPi4 first to make sure it can handle this many c3 plots, before I replot any more.

jayhohoho2019 commented 1 year ago

Doing another run for c5 n=23 (to fill a internal NVMe SSD) and it died at n=4. This time however there is no crash message, and there is no tmp file in the destination directory. This is now on chia 2.0.1 but bladebit --version still shows 3.0.0

Finished forward propagating table 4 in 38.77 seconds. Forward propagating to table 5... Pairing L/R groups... Finished pairing L/R groups in 10.3440 seconds. Created 4294233685 pairs. Average of 236.1003 pairs per group. Computing Fx...

jayhohoho2019 commented 1 year ago

I've been plotting c5 with this and it crashes much less often than c3 but still does from time to time, much more so than plotting uncompressed plots using bb v1 or v2. Is this related to ramplot (cpu plot) only? Does it exist for cudaplot as well?

jayhohoho2019 commented 1 year ago

I don't have a gpu in my plotter. With bb3 ramplot, plotting time for c3 is about the same as with v1/v2, and about 30 seconds faster for c5. So for ramplot, bb3 doesn't really offer any performance improvements, but it offers the ability to plot c plots. Is that the correct understanding? And then with cudaplot, gpu plotting?

jayhohoho2019 commented 1 year ago

Here is some system info again. Let me know please what additional info you need.
Ubuntu 22.04.3 LTS (Server) 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 2x Xeon(R) Gold 6230R 512GB system RAM Write buffer is 2x7 INTEL SSDPE2KE076T8 raided 0 with mdadm NO GPU

jayhohoho2019 commented 1 year ago

command to run bb (3.0.0) cd ~/chia-blockchain && . ./activate && bladebit -t '"$r"' -c '"$c"' -f '"$f"' -n '"$n"' -v -w -z '"$compress"' ramplot '"$dst"'

r=90 z=5 (or 3) $dst is the INTEL SSDs, nowhere near full when the crashes happen

jayhohoho2019 commented 1 year ago

Sometimes crash.log would contain the following:

bladebit(_Z12CrashHandleri+0xaa)[0x56368afbd91a] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc8e7999520] /lib/x86_64-linux-gnu/libc.so.6(+0x1afbba)[0x7fc8e7b06bba] bladebit(_Z15WriteParkThreadP12WriteParkJob+0x218)[0x56368b1290c8] bladebit(_ZN10ThreadPool17FixedThreadRunnerEPv+0x52)[0x56368b1336f2] bladebit(ZN6Thread17ThreadStarterUnixEPS+0x80)[0x56368afbe9d0] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7fc8e79ebb43] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7fc8e7a7da00]

Other times it would say park buffer overrun

harold-b commented 1 year ago

I don't have a gpu in my plotter. With bb3 ramplot, plotting time for c3 is about the same as with v1/v2, and about 30 seconds faster for c5. So for ramplot, bb3 doesn't really offer any performance improvements, but it offers the ability to plot c plots. Is that the correct understanding? And then with cudaplot, gpu plotting?

That's correct. Ramplot is exactly the same, the only difference is compressed plot support.

jayhohoho2019 commented 1 year ago

Thanks. ramplot c3 or c5 crashes way more often though than uncompressed plots in v1/v2. Anything could be done for it?

On Fri, Sep 15, 2023, 7:28 PM Harold Brenes @.***> wrote:

I don't have a gpu in my plotter. With bb3 ramplot, plotting time for c3 is about the same as with v1/v2, and about 30 seconds faster for c5. So for ramplot, bb3 doesn't really offer any performance improvements, but it offers the ability to plot c plots. Is that the correct understanding? And then with cudaplot, gpu plotting?

That's correct. Ramplot is exactly the same, the only difference is compressed plot support.

— Reply to this email directly, view it on GitHub https://github.com/Chia-Network/chia-blockchain/issues/16219#issuecomment-1722017157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD6DTS63FD3ZKKXFYGSJO3X2TQBJANCNFSM6AAAAAA4HFSFO4 . You are receiving this because you authored the thread.Message ID: @.***>

wjblanke commented 1 year ago

Harold didnt we increase some of these buffers?

harold-b commented 1 year ago

Those were the slice buffers (hold temporary data during plotting). The park buffers are fixed per plot file version. We'd have to generate new estimates and bump a file version to support new park sizes. Or allow the park size to be defined by the plot file itself, and not exceed the default uncompress park sizes

harold-b commented 1 year ago

@jayhohoho2019 I think the best workaround here is to run bladebit CLI directly from a shell script to automatically retry and cleanup any unfinished plots when it exits with an error exit code. If you need help with this I can set you up w/ something for Linux

jayhohoho2019 commented 1 year ago

@jayhohoho2019 I think the best workaround here is to run bladebit CLI directly from a shell script to automatically retry and cleanup any unfinished plots when it exits with an error exit code. If you need help with this I can set you up w/ something for Linux

Yes that'd be nice. Thank you.

How long do you think it will take to get the park buffer sizes increased? If not long I can wait too. Thanks.

jayhohoho2019 commented 1 year ago

fyi I had been replotting c5 using ramplot since sept 24th with a little script of resuming automatically. Here is a list of date stamps when bb crashed (and left an empty .tmp file), the later ones were bb 3.1.0 Sep 24 00:18 Sep 24 21:17 Sep 25 07:41 Sep 25 18:00 Sep 25 18:09 Sep 26 14:06 Sep 27 14:13 Sep 28 07:38 Sep 28 10:43 Sep 28 22:34 Sep 28 22:53 Sep 29 03:50 Sep 30 06:51 Oct 1 04:00 Oct 1 05:34 Oct 1 16:56 Oct 1 23:44 Oct 2 06:10 Oct 2 09:50 Oct 2 22:59 Oct 3 22:24 Oct 7 10:09 Oct 7 11:21 Oct 8 00:17 Oct 8 03:24 Oct 8 06:37 Oct 9 06:35 Oct 9 21:44

jayhohoho2019 commented 1 year ago

I connected a gpu to my plotter and am doing cudaplot now instead. haven't crashed yet. do ramplot and cudaplot handle park buffer differently?

wjblanke commented 11 months ago

Harold are these handled differently?

harold-b commented 11 months ago

No they shouldn't. But I can have a look