Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
337 stars 107 forks source link

Bladebit v2.0.0 failing at Sorting F7 & Writing C Tables #241

Closed yang1782 closed 1 year ago

yang1782 commented 1 year ago

.\bladebit.exe -t 31 -n 81 -f -c diskplot -t1 D:\Plotter -b 64 --cache 42G F:\Plots

Windows 10. AMD 5950X. 64 GB RAM. 2x Samsung 980 PRO 500GB RAID 0. Running BB in PowerShell. This is happening non-stop.

Issue

harold-b commented 1 year ago

Please check your windows event viewer for errors. Windows tends to silently kill the app when it runs out of memory or disk space

yang1782 commented 1 year ago

Please check your windows event viewer for errors. Windows tends to silently kill the app when it runs out of memory or disk space

I lowered -b to 32G and app still stopped. Have 1TB in disk space from raid 0 so that should be enough. Saw the error below in Event Viewer. Not sure what it means though.

Faulting application name: bladebit.exe, version: 0.0.0.0, time stamp: 0x6364350c Faulting module name: bladebit.exe, version: 0.0.0.0, time stamp: 0x6364350c Exception code: 0xc0000005 Fault offset: 0x0000000000008053 Faulting process id: 0x4108 Faulting application start time: 0x01d8efe6edd84acd Faulting application path: C:\Users\yang1\OneDrive\Desktop\bladebit-v2.0.0-windows-x86-64\bladebit.exe Faulting module path: C:\Users\yang1\OneDrive\Desktop\bladebit-v2.0.0-windows-x86-64\bladebit.exe Report Id: 41d1bb06-21f7-4ac0-9e35-c94abe1081df Faulting package full name: Faulting package-relative application ID:

ageorge95 commented 1 year ago

I have the exact same issue :(

yang1782 commented 1 year ago

I have the exact same issue :(

I'm almost certain there's something wrong w/ the Windows version.

FuzeGuy commented 1 year ago

Threadripper 3955WX. Since rc for bladebit 2.0 this exact error happens no matter what I use for parameters in the Poweshell CLI to run it. If I go back to an earliier build, it does this error rarely, and not often enough that I haven't been able to easily do 100+TB with that earlier version.

Bottom line... it's a continuing and constant fail. Doesn't software get tested with Windows ... at all ???!!! Frustrating!

yang1782 commented 1 year ago

I was able to get past Sorting F7 & Writing C Tables successfully using v2.0.0-beta1 while having the exact same command. Something is likely wrong w/ the windows version of v2.0.0.

parmenx commented 1 year ago

i am having the same problem. what should I do ? why does the problem occur? I used bladebit in chia gui.

ageorge95 commented 1 year ago

They were saying in their last AMA that their whole software releasing mechanism was overhauled and much different then in the beginning, yet they release a broken plotter on Windows ....

This issue should have been easily identified by a really simple test - someone should have tried to create 1 plot on windows and the issue would have popped-up.

Yeah, I am sticking with MadMax until they get their shit together.

(they = Chia & Co)

FuzeGuy commented 1 year ago

"Yeah, I am sticking with MadMax until they get their shit together."

Nah...I'm still going to use the beta release..... it even worked (most) of the time! And 33% plot time cut is too hard to give up.

yang1782 commented 1 year ago

"Yeah, I am sticking with MadMax until they get their shit together."

Nah...I'm still going to use the beta release..... it even worked (most) of the time! And 33% plot time cut is too hard to give up.

I've been testing the beta release since yesterday and I'm not getting much of an improvement oddly. Results are about the same as madmax. I'm assuming my 64G RAM isn't enough.

harold-b commented 1 year ago

Apologies for the issue. I am looking into it and will update here with the fix.

otterslide commented 1 year ago

Apologies for the issue. I am looking into it and will update here with the fix.

Same problem here, switched to old version, works ok.

Tim572 commented 1 year ago

https://github.com/Chia-Network/bladebit/issues/231

Drhicom commented 1 year ago

Right now no matter how I change the setting it dies after Sorting F7, but switch back to beta 1, still cooking.

Maybe there should have been a RC2 and RC3... You can't rush a good painting...

harold-b commented 1 year ago

Thank you for your patience guys. I've identified the issue and committed a fix, pending more testing for a minor release. If you'd like to try or test it you can use the latest CI artifacts here: https://github.com/Chia-Network/bladebit/actions/runs/3398805060

Or you if you'd like to build it yourself, use the develop branch.

Jacek-ghub commented 1 year ago

Faulting application name: bladebit.exe, version: 0.0.0.0, time stamp: 0x6364350c Faulting module name: bladebit.exe, version: 0.0.0.0, time stamp: 0x6364350c Exception code: 0xc0000005

Well, that tells actually quite a bit. "code: 0xc0000005" means that the program is trying to read/write from memory that it doesn't own. This is a bad code at play. "Exception" means that the program raised an exception (bad things happened), and the fact that it quietly aborted means that the code doesn't bother to catch any exceptions if for nothing else than just to print out some debugging info. (There is no single try/catch in the codebase.)

"Faulting application name: bladebit.exe, version: 0.0.0.0" That version number means that the code doesn't have much versioning info, or at least the app version is not there. However, looking at the binary (right click / details) shows that actually nothing is set, no version, no app name, no company name. (Actually, the binary doesn't even have the Chia icon.) Looking at the code, the versioning part is just missing. This is not a big issue rather just sloppy way of producing binaries, especially for an established company (it takes 10 mins to fix it).

Looking at what was suggested above "Windows tends to silently kill the app when it runs out of memory or disk space" there are two problems with that statement. First is that as explained exception was triggered and there is not code to catch them thus the silent exit. The second is ("it runs out of memory or disk space") just a false statement, as the program was not short of memory, it just didn't bother to either properly allocate required memory / reached outside of the allocated one, or didn't bother to check whether after asking for memory it really got it. The code was given an opportunity to react to that / print some helpful info (capturing that exception) but it didn't do that. Even if we assume that the system was low on those two resources (what is obviously not the case, as all of us see this exact problem in exactly the same place, and all our setups are different), this is programmer responsibility to check whether resources are available, and if not, then exit gracefully (if it cannot recover from it) stating what was missing (to help the end user to potentially resolve such problems).

By the way, that exception (05) is about any system resources (e.g., disk, registry, ...). However, it is safe to assume that the code was trying to access memory that it didn't own.

harold-b commented 1 year ago

@Jacek-ghub I think you mean well, but it would be great if you were to take a less antagonistic approach, and instead make arguments in good faith and without unfounded presuppositions.

Well, that tells actually quite a bit. "code: 0xc0000005" means that the program is trying to read/write from memory that it doesn't own. This is a bad code at play.

This is not always the case, it can happen when there's and issues with the RAM itself, or a target drive, which has happened already several times w/ users. Since it's been asked several times, naturally I leaned towards that side.

Indeed in this case, however, it was a segfault. You can have a look at the latest commits in develop to see if you can spot the cause. I left a hint in the commit message.

Indeed I disable exceptions them at the compiler level, you'll find this same method used in many high-performance code bases. Especially those of game engines. But I won't get into the reasoning in this thread.

In lieu of that I use a crash handler, as the Linux build has. There is one planned for windows, but I've not yet had time to add it.

or didn't bother to check whether after asking for memory it really got it.

I'll have to playfully quote you here, "just a false statement". You can have a look at the memory allocation code yourself and you'll see why that is.

I'd like to respectfully ask you to refrain from further using these threads to start such antagonistic arguments. If you'd like to help, then please submit a PR instead of taking this approach. Or if you'd like to yell at me, feel free to do so on Keybase

Jacek-ghub commented 1 year ago

I really respect what you did with BB, so that was not meant to discredit that part, sorry if that it was written sounding like that.

However, I cannot say that when the binary is released, and it looks like not a single Win run was done to test it. It could be understood for rc2, ..., but this was the production version for v2.0.

harold-b commented 1 year ago

Indeed it was a mistake on my part when I back-ported the in-ram plotter into the v2 code. I missed a file.

We had just done very thorough testing across all platforms on the diskplotter side just before that merge. So when I merged in the ramplot command, since it did not touch any diskplotter code, I incorrectly assumed we could get away with testing mainly the ramplot command and its output plots, and just ensure diskplotter got in through a few tables. That was my miscalculation. I appreciate, and share your high-bar expectancy for releases, so I'll certainly endeavor toward things like this not happening again.

Jacek-ghub commented 1 year ago

That is the reason to have QA not reporting to Eng, as this way, they don't care what we say. They are held responsible for bad releases, so the test coverage is always broad and consistent (by running their test scripts).

So, I do feel sorry for you to be responsible for releases, as that is the headache that no one dev wants to have on his/her head (and in my opinion should not be doing).

ageorge95 commented 1 year ago

Thank you for your patience guys. I've identified the issue and committed a fix, pending more testing for a minor release. If you'd like to try or test it you can use the latest CI artifacts here: https://github.com/Chia-Network/bladebit/actions/runs/3398805060

Or you if you'd like to build it yourself, use the develop branch.

Hi,

I confirm that I can successfully create a plot on windows with develop @ 7b5cb4ac5d73526996595120cd9a25e17a03bed9

Thanks for the quick fix, I was not expecting it so soon đź‘Ť

spleen911 commented 1 year ago

I have not been able to run any build successfully on Windows Server 2019. Just running "memtest -s 1GB" on alpha1, alpha2, beta1, v2 and the above artifact generates a fault in ntdll.dll either exception 0xc0000005 or 0xc0000374. The system is a DELL R620 (dual Xeon E5-2697 v2 with 256GB ECC memory). Tried run as admin, Windows Defender off, Compatibility mode enabled but no luck.

Faulting application name: bladebit.exe, version: 0.0.0.0, time stamp: 0x6364350c
Faulting module name: ntdll.dll, version: 10.0.17763.3532, time stamp: 0xbe72b56e
Exception code: 0xc0000374
Fault offset: 0x00000000000fc1b9
Faulting process id: 0x23fc
Faulting application start time: 0x01d8f14f72c014cc
Faulting application path: D:\Plotter\bladebit.exe
Faulting module path: C:\windows\SYSTEM32\ntdll.dll

Faulting application name: bladebit.exe, version: 0.0.0.0, time stamp: 0x6364350c
Faulting module name: ntdll.dll, version: 10.0.17763.3532, time stamp: 0xbe72b56e
Exception code: 0xc0000005
Fault offset: 0x0000000000012656
Faulting process id: 0x28a8
Faulting application start time: 0x01d8f14f5c2c5e90
Faulting application path: D:\Plotter\bladebit.exe
Faulting module path: C:\windows\SYSTEM32\ntdll.dll

I'm still plotting with madmax and a 100GiB RAM disk đź‘Ž

harold-b commented 1 year ago

@spleen911 Try doing a regular diskplot run and paste the bladebit output (regardless of the crash)

spleen911 commented 1 year ago

@harold-b I was on Digital Space port live stream and your comments helped.

Prior to your reply, I tried disabling hyper threading in the BIOS (DELL lists it as “logical processors”) and that bypassed the ntdll.dll faults. Very odd as I’m able to run BB on other single CPU, multi-core with HT (i7-4790 and i7-11700 both WIN10) without issue.

Using beta1, I was able to complete some plots. 1 of 4 fail during memory allocation which is what happened every time with hyper threading enabled in the BIOS:

[Bladebit Disk Plotter]
 Heap size      : 3.37 GiB ( 3447.82 MiB )
 Cache size     : 100.00 GiB ( 102400.00 MiB )
 Bucket count   : 256
 Alternating I/O: true
 F1  threads    : 24
 FP  threads    : 24
 C   threads    : 24
 P2  threads    : 24
 P3  threads    : 24
 I/O threads    : 1
 Temp1 block sz : 4096
 Temp2 block sz : 4096
 Temp1 path     : F:\ChiaTemp\
 Temp2 path     : F:\ChiaTemp\
 I/O metrices enabled.
 Allocating memory

Faulting application name: bad_module_info, version: 0.0.0.0, time stamp: 0x00000000
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x0000001700000016
Faulting process id: 0x2218
Faulting application start time: 0x01d8f17bccd4de5d
Faulting application path: bad_module_info

At this point I went back into the BIOS to enable logical processors for the purpose of duplicating the faults and providing console output results. While in the BIOS I enabled memory setting to interleave for NUMA if running matched memory. I left memory mode to "Optimized" however I plan to test "Advanced ECC" later.

Oddly, but thankfully, I can now run memtest using all builds and a full plot with v2 "fix" (artifact from https://github.com/Chia-Network/bladebit/actions/runs/3398805060). I am trying out different diskplot settings. Last one was -t 40 --no-cpu-affinity -f -c -w -v diskplot -b 64 --cache 32G --no-t2-direct -t1 NVMe and -t2 RAM (192GB), but I/O wait is really high after table 1; however, it still finished under 47 minutes.

Bladebit Chia Plotter
Version      : 2.0.0
Git Commit   : 7b5cb4ac5d73526996595120cd9a25e17a03bed9
Compiled With: msvc 19.29.30146

[Global Plotting Config]
 Will create 1 plots.
 Thread count          : 40
 Warm start enabled    : true
 NUMA disabled         : false
 CPU affinity disabled : true
 Farmer public key     : <redacted>
 Pool contract address : <redacted>
 Output path           : D:\MoveToOtherMiner\

[Bladebit Disk Plotter]
 Heap size      : 11.56 GiB ( 11840.27 MiB )
 Cache size     : 32.00 GiB ( 32768.00 MiB )
 Bucket count   : 64
 Alternating I/O: false
 F1  threads    : 40
 FP  threads    : 40
 C   threads    : 40
 P2  threads    : 40
 P3  threads    : 40
 I/O threads    : 1
 Temp1 block sz : 4096
 Temp2 block sz : 4096
 Temp1 path     : F:\ChiaTemp\
 Temp2 path     : R:\
 I/O metrices enabled.
 Allocating memory
Warm start: Pre-faulting memory pages...
Memory initialized.

Generating plot 1 / 1: 16984624c3759770606d3daccd56d6e8bdce9417b27adb172c09210058b6a249

Started plot.
Running Phase 1
Table 1: F1 generation
Generating f1...
Finished f1 generation in 39.40 seconds.
Table 1 I/O wait time: 0.00 seconds.
 Table 1 Disk Write Metrics:
  Average write throughput 833.79 MiB ( 874.29 MB ) or 0.81 GiB ( 0.87 GB ).
  Total size written: 32799.77 MiB ( 34393.05 MB ) or 32.03 GiB ( 34.39 GB ).
  Total write commands: 129.

Table 2
 Sorting      : Completed in 30.03 seconds.
 Distribution : Completed in 88.56 seconds.
 Matching     : Completed in 15.72 seconds.
 Fx           : Completed in 19.39 seconds.
Completed table 2 in 162.08 seconds with 4294891266 entries.
Table 2 I/O wait time: 90.61 seconds.
 Table 2 I/O Metrics:
  Average read throughput 1686.13 MiB ( 1768.04 MB ) or 1.65 GiB ( 1.77 GB ).
  Total size read: 32799.77 MiB ( 34393.05 MB ) or 32.03 GiB ( 34.39 GB ).
  Total read commands: 8192.
  Average write throughput 704.79 MiB ( 739.03 MB ) or 0.69 GiB ( 0.74 GB ).
  Total size written: 100398.10 MiB ( 105275.04 MB ) or 98.05 GiB ( 105.28 GB ).
  Total write commands: 321.

Table 3
 Sorting      : Completed in 41.59 seconds.
 Distribution : Completed in 129.74 seconds.
 Matching     : Completed in 15.81 seconds.
 Fx           : Completed in 20.82 seconds.
Completed table 3 in 281.37 seconds with 4294698884 entries.
Table 3 I/O wait time: 127.70 seconds.
 Table 3 I/O Metrics:
  Average read throughput 1315.53 MiB ( 1379.44 MB ) or 1.28 GiB ( 1.38 GB ).
  Total size read: 65582.42 MiB ( 68768.15 MB ) or 64.05 GiB ( 68.77 GB ).
  Total read commands: 12288.
  Average write throughput 632.75 MiB ( 663.49 MB ) or 0.62 GiB ( 0.66 GB ).
  Total size written: 146471.66 MiB ( 153586.67 MB ) or 143.04 GiB ( 153.59 GB ).
  Total write commands: 4354.

Table 4
 Sorting      : Completed in 41.78 seconds.
 Distribution : Completed in 98.66 seconds.
 Matching     : Completed in 16.12 seconds.
 Fx           : Completed in 45.28 seconds.
Completed table 4 in 312.00 seconds with 4294408082 entries.
Table 4 I/O wait time: 97.54 seconds.
 Table 4 I/O Metrics:
  Average read throughput 1131.82 MiB ( 1186.80 MB ) or 1.11 GiB ( 1.19 GB ).
  Total size read: 98345.34 MiB ( 103122.56 MB ) or 96.04 GiB ( 103.12 GB ).
  Total read commands: 12288.
  Average write throughput 650.74 MiB ( 682.35 MB ) or 0.64 GiB ( 0.68 GB ).
  Total size written: 146462.47 MiB ( 153577.03 MB ) or 143.03 GiB ( 153.58 GB ).
  Total write commands: 4354.

Table 5
 Sorting      : Completed in 41.29 seconds.
 Distribution : Completed in 98.78 seconds.
 Matching     : Completed in 16.09 seconds.
 Fx           : Completed in 45.32 seconds.
Completed table 5 in 311.99 seconds with 4293824452 entries.
Table 5 I/O wait time: 97.66 seconds.
 Table 5 I/O Metrics:
  Average read throughput 1131.80 MiB ( 1186.78 MB ) or 1.11 GiB ( 1.19 GB ).
  Total size read: 98338.73 MiB ( 103115.63 MB ) or 96.03 GiB ( 103.12 GB ).
  Total read commands: 12288.
  Average write throughput 650.67 MiB ( 682.28 MB ) or 0.64 GiB ( 0.68 GB ).
  Total size written: 146444.65 MiB ( 153558.35 MB ) or 143.01 GiB ( 153.56 GB ).
  Total write commands: 4354.

Table 6
 Sorting      : Completed in 38.52 seconds.
 Distribution : Completed in 67.60 seconds.
 Matching     : Completed in 16.29 seconds.
 Fx           : Completed in 20.72 seconds.
Completed table 6 in 258.88 seconds with 4292668351 entries.
Table 6 I/O wait time: 71.12 seconds.
 Table 6 I/O Metrics:
  Average read throughput 1114.68 MiB ( 1168.83 MB ) or 1.09 GiB ( 1.17 GB ).
  Total size read: 98325.42 MiB ( 103101.67 MB ) or 96.02 GiB ( 103.10 GB ).
  Total read commands: 12288.
  Average write throughput 666.08 MiB ( 698.43 MB ) or 0.65 GiB ( 0.70 GB ).
  Total size written: 113658.75 MiB ( 119179.84 MB ) or 110.99 GiB ( 119.18 GB ).
  Total write commands: 4354.

Table 7
 Sorting      : Completed in 32.83 seconds.
 Distribution : Completed in 26.80 seconds.
 Matching     : Completed in 15.92 seconds.
 Fx           : Completed in 20.38 seconds.
Completed table 7 in 180.81 seconds with 4290327812 entries.
Table 7 I/O wait time: 32.15 seconds.
 Table 7 I/O Metrics:
  Average read throughput 1297.78 MiB ( 1360.82 MB ) or 1.27 GiB ( 1.36 GB ).
  Total size read: 65548.52 MiB ( 68732.60 MB ) or 64.01 GiB ( 68.73 GB ).
  Total read commands: 12288.
  Average write throughput 627.00 MiB ( 657.46 MB ) or 0.61 GiB ( 0.66 GB ).
  Total size written: 80856.51 MiB ( 84784.19 MB ) or 78.96 GiB ( 84.78 GB ).
  Total write commands: 4290.

Sorting F7 & Writing C Tables
Completed F7 tables in 75.71 seconds.
F7/C Tables I/O wait time: 53.94 seconds.
Finished Phase 1 in 1623.24 seconds ( 27.1 minutes ).
Running Phase 2
Finished marking table 6 in 9.29 seconds.
Table 6 I/O wait time: 0.00 seconds.
Finished marking table 5 in 20.73 seconds.
Table 5 I/O wait time: 0.00 seconds.
Finished marking table 4 in 20.84 seconds.
Table 4 I/O wait time: 0.00 seconds.
Finished marking table 3 in 20.93 seconds.
Table 3 I/O wait time: 0.00 seconds.
Finished marking table 2 in 20.84 seconds.
Table 2 I/O wait time: 0.00 seconds.
 Phase 2 Total I/O wait time: 0.00 seconds.
Finished Phase 2 in 93.34 seconds ( 1.6 minutes ).
Running Phase 3
Compressing tables 1 and 2.
Step 1 Allocated 9830.69 / 11840.27 MiB
Step 2 using 6.80 / 11.56 GiB.
Table 1 now has 3429168656 / 4294891266 ( 79.84% ) entries.
Table 1 I/O wait time: 64.84 seconds.
Finished compressing tables 1 and 2 in 140.49 seconds.
Compressing tables 2 and 3.
Step 1 Allocated 11840.27 / 11840.27 MiB
Step 2 using 6.75 / 11.56 GiB.
Table 2 now has 3439432739 / 4294698884 ( 80.09% ) entries.
Table 2 I/O wait time: 81.96 seconds.
Finished compressing tables 2 and 3 in 165.91 seconds.
Compressing tables 3 and 4.
Step 1 Allocated 11840.27 / 11840.27 MiB
Step 2 using 6.75 / 11.56 GiB.
Table 3 now has 3465269713 / 4294408082 ( 80.69% ) entries.
Table 3 I/O wait time: 83.09 seconds.
Finished compressing tables 3 and 4 in 167.32 seconds.
Compressing tables 4 and 5.
Step 1 Allocated 11840.27 / 11840.27 MiB
Step 2 using 6.75 / 11.56 GiB.
Table 4 now has 3531471871 / 4293824452 ( 82.25% ) entries.
Table 4 I/O wait time: 82.87 seconds.
Finished compressing tables 4 and 5 in 168.14 seconds.
Compressing tables 5 and 6.
Step 1 Allocated 11840.27 / 11840.27 MiB
Step 2 using 6.75 / 11.56 GiB.
Table 5 now has 3711060384 / 4292668351 ( 86.45% ) entries.
Table 5 I/O wait time: 87.84 seconds.
Finished compressing tables 5 and 6 in 176.00 seconds.
Compressing tables 6 and 7.
Step 1 Allocated 11840.27 / 11840.27 MiB
Step 2 using 6.77 / 11.56 GiB.
Table 6 now has 4290327812 / 4290327812 ( 100.00% ) entries.
Table 6 I/O wait time: 113.05 seconds.
Finished compressing tables 6 and 7 in 198.77 seconds.
Writing P7 parks.
Finished writing P7 parks in 57.29 seconds.
P7 I/O wait time: 45.30 seconds
Finished Phase 3 in 1075.23 seconds ( 17.9 minutes ).
Total plot I/O wait time: 1165.87 seconds.
Waiting for plot file to complete pending writes...
Completed pending writes in 0.03 seconds.
Finished writing plot plot-k32-2022-11-05-23-03-16984624c3759770606d3daccd56d6e8bdce9417b27adb172c09210058b6a249.plot.tmp.
Final plot table pointers:
 Table 1:       1288815564 ( 0x000000004cd1c3cc )
 Table 2:       3242437614 ( 0x00000000c143abee )
 Table 3:         43665005 ( 0x00000000029a466d )
 Table 4:       1244887892 ( 0x000000004a337b54 )
 Table 5:       2715216404 ( 0x00000000a1d6ea14 )
 Table 6:        620596870 ( 0x0000000024fd8e86 )
 Table 7:        880661961 ( 0x00000000347dd5c9 )
 C 1    :              252 ( 0x00000000000000fc )
 C 2    :          1716388 ( 0x00000000001a30a4 )
 C 3    :          1716564 ( 0x00000000001a3154 )

Finished plotting in 2791.87 seconds ( 46.5 minutes ).
Renaming plot to 'D:\MoveToOtherMiner\plot-k32-2022-11-05-23-03-16984624c3759770606d3daccd56d6e8bdce9417b27adb172c09210058b6a249.plot'

For reference, I run MM with -r 24 -K 4 -u 512 -v 256 -t NVMe -2 RAM (110GB) and finish in 47-48 minutes.

spleen911 commented 1 year ago

I am able to get ~43.3 minute (259X seconds) plot times with following configuration đź‘Ť

--no-cpu-affinity -w -v diskplot -b 64 --no-t2-direct --cache 32G --f1-threads 24 --fp-threads 16 --c-threads 32 --p2-threads 48 --p3-threads 16

T1 = Samsung 970 EVO Plus T2 = 192GB RAM PC3L-12800R OUT = Perc H710P RAID0 8x6Gbs-SAS

harold-b commented 1 year ago

I am very glad you got it working. I am curious as to why the fails are happening with the other BIOS settings and if they are only happening on windows.

I will try to look into that when I have a chance as it is possible it is the same issue other users reported in windows. I appreciate the report and logs.

AlexGuo1998 commented 1 year ago

Hi. Same issue here but on linux (Ubuntu) with NTFS partitions

Increasing the file limit from 1024 to 524288
Warning: Lowering thread count from 6 to 4, the native maximum.
Bladebit Chia Plotter
Version      : 2.0.0
Git Commit   : d64791880af89edebb6f1126c953d4d98b8007db
Compiled With: gcc 9.4.0

[Global Plotting Config]
 Will create 1 plots.
 Thread count          : 4
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : ...
 Pool contract address : ...
 Output path           : /media/alex/.../chia/plots/

[Bladebit Disk Plotter]
 Heap size      : 2.02 GiB ( 2070.62 MiB )
 Cache size     : 3.00 GiB ( 3072.00 MiB )
 Bucket count   : 512
 Alternating I/O: false
 F1  threads    : 4
 FP  threads    : 4
 C   threads    : 4
 P2  threads    : 4
 P3  threads    : 4
 I/O threads    : 1
 Temp1 block sz : 4096
 Temp2 block sz : 4096
 Temp1 path     : /media/alex/.../chia/temp/
 Temp2 path     : /media/alex/.../chia/temp/
 I/O metrices enabled.
 Allocating memory
WARNING: Forcing warm start for testing.
Warm start: Pre-faulting memory pages...
Memory initialized.

Generating plot 1 / 1: ...
Plot Memo: ...

Started plot.

...

Table 7
 Sorting      : Completed in 139.69 seconds.
 Distribution : Completed in 179.24 seconds.
 Matching     : Completed in 204.72 seconds.
 Fx           : Completed in 278.45 seconds.
Completed table 7 in 10537.88 seconds with 4284513739 entries.
Table 7 I/O wait time: 4688.68 seconds.
 Table 7 I/O Metrics:
  Average read throughput 9.34 MiB ( 9.80 MB ) or 0.01 GiB ( 0.01 GB ).
  Total size read: 68522.46 MiB ( 71851.01 MB ) or 66.92 GiB ( 71.85 GB ).
  Total read commands: 786432.
  Average write throughput 24.94 MiB ( 26.15 MB ) or 0.02 GiB ( 0.03 GB ).
  Total size written: 79713.53 MiB ( 83585.69 MB ) or 77.85 GiB ( 83.59 GB ).
  Total write commands: 263682.

Sorting F7 & Writing C Tables
*** Crashed! ***
./bladebit(_Z12CrashHandleri+0xaa)[0x55b9d64c011a]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3d81e42520]
./bladebit(_ZN16K32BoundedPhase114RunWithBucketsILj512EEEvv+0x9c0)[0x55b9d647ffd0]
./bladebit(_ZN11DiskPlotter4PlotERKNS_11PlotRequestE+0x1cc)[0x55b9d644cd6c]
./bladebit(main+0xdef)[0x55b9d644bc2f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f3d81e29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f3d81e29e40]
./bladebit(_start+0x2e)[0x55b9d644c92e]
Dumping crash to crash.log

c++filt result:

$ c++filt < crash.log
./bladebit(CrashHandler(int)+0xaa)[0x55b9d64c011a]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3d81e42520]
./bladebit(void K32BoundedPhase1::RunWithBuckets<512u>()+0x9c0)[0x55b9d647ffd0]
./bladebit(DiskPlotter::Plot(DiskPlotter::PlotRequest const&)+0x1cc)[0x55b9d644cd6c]
./bladebit(main+0xdef)[0x55b9d644bc2f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f3d81e29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f3d81e29e40]
./bladebit(_start+0x2e)[0x55b9d644c92e]

Thank you for your patience guys. I've identified the issue and committed a fix, pending more testing for a minor release. If you'd like to try or test it you can use the latest CI artifacts here: https://github.com/Chia-Network/bladebit/actions/runs/3398805060

Trying dev builds now...

AlexGuo1998 commented 1 year ago

Some debugging information.

$ gdb bladebit
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Reading symbols from bladebit...
(gdb) list *(_ZN16K32BoundedPhase114RunWithBucketsILj512EEEvv+0x9c0)
0xc3fd0 is in K32BoundedPhase1::RunWithBuckets<512u>() (/home/runner/work/bladebit/bladebit/src/threading/MTJob.h:533).

https://github.com/Chia-Network/bladebit/blob/d64791880af89edebb6f1126c953d4d98b8007db/src/threading/MTJob.h#L525-L537

(gdb) list *(_ZN11DiskPlotter4PlotERKNS_11PlotRequestE+0x1cc)
0x90d6c is in DiskPlotter::Plot(DiskPlotter::PlotRequest const&) (/home/runner/work/bladebit/bladebit/src/plotdisk/DiskPlotter.cpp:174).

https://github.com/Chia-Network/bladebit/blob/d64791880af89edebb6f1126c953d4d98b8007db/src/plotdisk/DiskPlotter.cpp#L172-L179

wangzhenyue commented 1 year ago

Windows 10 I have the same problem Sorting F7 & Writing C Tables

spleen911 commented 1 year ago

@harold-b On a different DELL R620 I toggled HT processor and Node Interleaving memory settings:

HT=on, Node Interleave=off --> ntdll.dll fault HT=off, Node Interleave=off --> success HT=on, Node Interleave=on --> success

ageorge95 commented 1 year ago

The new Release fixes this reported issue (https://github.com/Chia-Network/bladebit/releases/tag/v2.0.1) - for me at least.

@yang1782 is the issue fixed for you as well ? If yes, then maybe this issue can be closed, if not, then it should remain open 🙂.

AlexGuo1998 commented 1 year ago

Thank you for your patience guys. I've identified the issue and committed a fix, pending more testing for a minor release. If you'd like to try or test it you can use the latest CI artifacts here: https://github.com/Chia-Network/bladebit/actions/runs/3398805060

Trying dev builds now...

No luck, exact same crash (_ZN16K32BoundedPhase114RunWithBucketsILj512EEEvv+0x9c0) with commit 9fac46aff0476e829d476412de18497a3a2f7ed8

However I can create plots with -b 256 without issue, while -b 512 will always crash. Is this an unsupported configuration?

Jacek-ghub commented 1 year ago

Same here (Win 11, v2.0.1), -b 512 crashes.

ageorge95 commented 1 year ago

@AlexGuo1998 , @Jacek-ghub , I think that is a different issue and should be created separately 🙂

(I am not using -b 512, I use -b 128, so luckily I am not affected by this)

Jacek-ghub commented 1 year ago

In my case, it failed at that Sorting F7 ..., so looks related.

harold-b commented 1 year ago

This should be a different issue. It would be good if new threads would be started with full logs and system specs and the CLI line used and I'll be happy to have a look.

512 is supported, but it's possible a regression happened somewhere. If you guys could do the indicated above it would help me gather info for attempting to reproduce.

Jacek-ghub commented 1 year ago

512 is supported, but it's possible a regression happened somewhere. If you guys could do the indicated above it would help me gather info for attempting to reproduce.

Could you do just one run on Win instead? You can run instrumented code, and we don't have debug builds.