Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
337 stars 106 forks source link

Bladebit v2.0.1 diskplot crashed with -b 512 #249

Closed AlexGuo1998 closed 1 year ago

AlexGuo1998 commented 1 year ago

As mentioned in https://github.com/Chia-Network/bladebit/issues/241#issuecomment-1309212321, creating a new issue about this.


Bladebit v2.0.1 diskplot crashed with -b 512, while -b 256 works fine.

Command used: (keys masked with ...)

./bladebit \
    -v \
    -t 4 \
    -c ... -f ... \
    --show-memo \
    diskplot \
    --temp1 /TEMP/chia/temp/ \
    --cache 5120M \
    --p2-threads 4 \
    --p3-threads 4 \
    --buckets 512 \
    /OUT/chia/plots/

where TEMP and OUT are two seperate ntfs partitions (in HDDs).

Logs:

Increasing the file limit from 1024 to 524288

Bladebit Chia Plotter
Version      : 2.0.1
Git Commit   : 9fac46aff0476e829d476412de18497a3a2f7ed8
Compiled With: gcc 9.4.0

[Global Plotting Config]
 Will create 1 plots.
 Thread count          : 4
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : ...
 Pool contract address : ...
 Output path           : /OUT/chia/plots/

[Bladebit Disk Plotter]
 Heap size      : 2.02 GiB ( 2070.62 MiB )
 Cache size     : 5.00 GiB ( 5120.00 MiB )
 Bucket count   : 512
 Alternating I/O: false
 F1  threads    : 4
 FP  threads    : 4
 C   threads    : 4
 P2  threads    : 4
 P3  threads    : 4
 I/O threads    : 1
 Temp1 block sz : 4096
 Temp2 block sz : 4096
 Temp1 path     : /TEMP/chia/temp/
 Temp2 path     : /TEMP/chia/temp/
 I/O metrices enabled.
 Allocating memory

Generating plot 1 / 1: ...
Plot Memo: ...

Started plot.
Running Phase 1
Table 1: F1 generation
Generating f1...
Finished f1 generation in 513.51 seconds.
Table 1 I/O wait time: 513.51 seconds.
 Table 1 Disk Write Metrics:
  Average write throughput 67.80 MiB ( 71.10 MB ) or 0.07 GiB ( 0.07 GB ).
  Total size written: 34811.99 MiB ( 36503.02 MB ) or 34.00 GiB ( 36.50 GB ).
  Total write commands: 1025.

Table 2
 Sorting      : Completed in 87.27 seconds.
 Distribution : Completed in 1082.87 seconds.
 Matching     : Completed in 222.28 seconds.
 Fx           : Completed in 269.17 seconds.
Completed table 2 in 4137.27 seconds with 4294917803 entries.
Table 2 I/O wait time: 2291.12 seconds.
 Table 2 I/O Metrics:
  Average read throughput 14.02 MiB ( 14.71 MB ) or 0.01 GiB ( 0.01 GB ).
  Total size read: 34811.99 MiB ( 36503.02 MB ) or 34.00 GiB ( 36.50 GB ).
  Total read commands: 524288.
  Average write throughput 63.55 MiB ( 66.63 MB ) or 0.06 GiB ( 0.07 GB ).
  Total size written: 101880.01 MiB ( 106828.94 MB ) or 99.49 GiB ( 106.83 GB ).
  Total write commands: 2561.

Table 3
 Sorting      : Completed in 147.27 seconds.
 Distribution : Completed in 1418.99 seconds.
 Matching     : Completed in 217.00 seconds.
 Fx           : Completed in 277.70 seconds.
Completed table 3 in 8924.94 seconds with 4294626680 entries.
Table 3 I/O wait time: 4100.47 seconds.
 Table 3 I/O Metrics:
  Average read throughput 13.18 MiB ( 13.82 MB ) or 0.01 GiB ( 0.01 GB ).
  Total size read: 68600.20 MiB ( 71932.53 MB ) or 66.99 GiB ( 71.93 GB ).
  Total read commands: 786432.
  Average write throughput 39.94 MiB ( 41.88 MB ) or 0.04 GiB ( 0.04 GB ).
  Total size written: 146413.54 MiB ( 153525.73 MB ) or 142.98 GiB ( 153.53 GB ).
  Total write commands: 264194.

Table 4
 Sorting      : Completed in 159.24 seconds.
 Distribution : Completed in 1457.74 seconds.
 Matching     : Completed in 213.91 seconds.
 Fx           : Completed in 285.65 seconds.
Completed table 4 in 11388.92 seconds with 4294116807 entries.
Table 4 I/O wait time: 4676.61 seconds.
 Table 4 I/O Metrics:
  Average read throughput 13.82 MiB ( 14.49 MB ) or 0.01 GiB ( 0.01 GB ).
  Total size read: 101359.21 MiB ( 106282.83 MB ) or 98.98 GiB ( 106.28 GB ).
  Total read commands: 786432.
  Average write throughput 36.50 MiB ( 38.27 MB ) or 0.04 GiB ( 0.04 GB ).
  Total size written: 146397.90 MiB ( 153509.32 MB ) or 142.97 GiB ( 153.51 GB ).
  Total write commands: 264194.

Table 5
 Sorting      : Completed in 158.89 seconds.
 Distribution : Completed in 1442.17 seconds.
 Matching     : Completed in 205.81 seconds.
 Fx           : Completed in 271.93 seconds.
Completed table 5 in 11881.09 seconds with 4293069208 entries.
Table 5 I/O wait time: 4695.00 seconds.
 Table 5 I/O Metrics:
  Average read throughput 13.79 MiB ( 14.45 MB ) or 0.01 GiB ( 0.01 GB ).
  Total size read: 101347.47 MiB ( 106270.53 MB ) or 98.97 GiB ( 106.27 GB ).
  Total read commands: 786432.
  Average write throughput 32.54 MiB ( 34.12 MB ) or 0.03 GiB ( 0.03 GB ).
  Total size written: 146366.36 MiB ( 153476.25 MB ) or 142.94 GiB ( 153.48 GB ).
  Total write commands: 264194.

Table 6
 Sorting      : Completed in 156.66 seconds.
 Distribution : Completed in 838.96 seconds.
 Matching     : Completed in 211.19 seconds.
 Fx           : Completed in 273.15 seconds.
Completed table 6 in 11297.47 seconds with 4290998304 entries.
Table 6 I/O wait time: 4005.24 seconds.
 Table 6 I/O Metrics:
  Average read throughput 13.76 MiB ( 14.43 MB ) or 0.01 GiB ( 0.01 GB ).
  Total size read: 101323.40 MiB ( 106245.28 MB ) or 98.95 GiB ( 106.25 GB ).
  Total read commands: 786432.
  Average write throughput 29.06 MiB ( 30.47 MB ) or 0.03 GiB ( 0.03 GB ).
  Total size written: 113567.91 MiB ( 119084.58 MB ) or 110.91 GiB ( 119.08 GB ).
  Total write commands: 264194.

Table 7
 Sorting      : Completed in 144.30 seconds.
 Distribution : Completed in 345.86 seconds.
 Matching     : Completed in 210.97 seconds.
 Fx           : Completed in 270.05 seconds.
Completed table 7 in 8819.13 seconds with 4286916689 entries.
Table 7 I/O wait time: 2989.92 seconds.
 Table 7 I/O Metrics:
  Average read throughput 13.26 MiB ( 13.90 MB ) or 0.01 GiB ( 0.01 GB ).
  Total size read: 68539.96 MiB ( 71869.36 MB ) or 66.93 GiB ( 71.87 GB ).
  Total read commands: 786432.
  Average write throughput 21.97 MiB ( 23.04 MB ) or 0.02 GiB ( 0.02 GB ).
  Total size written: 79749.10 MiB ( 83622.99 MB ) or 77.88 GiB ( 83.62 GB ).
  Total write commands: 263682.

Sorting F7 & Writing C Tables
*** Crashed! ***
./bladebit(_Z12CrashHandleri+0xaa)[0x56333d98a18a]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f18cde42520]
./bladebit(_ZN16K32BoundedPhase114RunWithBucketsILj512EEEvv+0x9c0)[0x56333d94a040]
./bladebit(_ZN11DiskPlotter4PlotERKNS_11PlotRequestE+0x1cc)[0x56333d916d6c]
./bladebit(main+0xdef)[0x56333d915c2f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f18cde29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f18cde29e40]
./bladebit(_start+0x2e)[0x56333d91692e]
Dumping crash to crash.log

Other information:

(Any other information needed?)

I'm able to build a debug build to collect core dumps if needed, however, my configuration is quite slow.

Walrusbonzo commented 1 year ago

Similar issue here, but with default bucket size.

Bladebit Chia Plotter Version : 2.0.0 Git Commit : d64791880af89edebb6f1126c953d4d98b8007db Compiled With: msvc 19.29.30146

[Global Plotting Config] Will create 1 plots. Thread count : 16 Warm start enabled : false NUMA disabled : false CPU affinity disabled : false Farmer public key : xxx Pool contract address : xxx Output path : C:\Temp\

[Bladebit Disk Plotter] Heap size : 3.37 GiB ( 3447.82 MiB ) Cache size : 4.00 GiB ( 4096.00 MiB ) Bucket count : 256 Alternating I/O: false F1 threads : 16 FP threads : 16 C threads : 16 P2 threads : 16 P3 threads : 16 I/O threads : 1 Temp1 block sz : 4096 Temp2 block sz : 4096 Temp1 path : D:\ Temp2 path : D:\ I/O metrices enabled. Allocating memory WARNING: Forcing warm start for testing. Warm start: Pre-faulting memory pages... Memory initialized.

............ ............ ............

Table 7 Sorting : Completed in 25.13 seconds. Distribution : Completed in 2.45 seconds. Matching : Completed in 22.31 seconds. Fx : Completed in 25.97 seconds. Completed table 7 in 86.47 seconds with 4290248567 entries. Table 7 I/O wait time: 0.09 seconds. Table 7 I/O Metrics: Average read throughput 1298.40 MiB ( 1361.47 MB ) or 1.27 GiB ( 1.36 GB ). Total size read: 66265.89 MiB ( 69484.82 MB ) or 64.71 GiB ( 69.48 GB ). Total read commands: 196608. Average write throughput 2369.90 MiB ( 2485.02 MB ) or 2.31 GiB ( 2.49 GB ). Total size written: 79287.82 MiB ( 83139.31 MB ) or 77.43 GiB ( 83.14 GB ). Total write commands: 66306.

Sorting F7 & Writing C Tables

pause Press any key to continue . . .

Bladebit just stops during "Sorting F7 & Writing C Tables"

Tried cache sizes of 4G, 16G, 20G and 24G, same issue each time.

Walrusbonzo commented 1 year ago

Fixed now, updated to 2.0.1

nufan1 commented 1 year ago

It is not fixed, you can see from the logs posted, version is 2.0.1

I am facing the same problem, default bucket size. It is random problem and not always applicable.

Walrusbonzo commented 1 year ago

I just meant updating to 2.0.1 fixed the problem I was having.

harold-b commented 1 year ago

It is not fixed, you can see from the logs posted, version is 2.0.1

It is fixed, whatever problem here, if a bug, is not in any way related to the previous issue in 2.0.0

The initial question here would be if the temp directory has enough space as using 512 buckets will require more because of alignment requirements

nufan1 commented 1 year ago

My setup is: Temp1 dedicated NVMe 1TB 50GB --cache (total of 64GB DDR4) 256 Bucket Size I will say temp has sufficient space in my case.

Running with -n 12 sometimes stops on 3rd plot, sometimes on 7th, sometimes will complete all 12, it is totally random problem for me. In last 24h it happen 3 times. Version is 2.0.1

AlexGuo1998 commented 1 year ago

The initial question here would be if the temp directory has enough space as using 512 buckets will require more because of alignment requirements

I think 580GB (540GiB) must be enough? Will check with more free space later.

nufan1 commented 1 year ago

image

Faulting application name: bladebit.exe, version: 0.0.0.0, time stamp: 0x636978c6 Faulting module name: bladebit.exe, version: 0.0.0.0, time stamp: 0x636978c6 Exception code: 0xc0000005 Fault offset: 0x0000000000146af0 Faulting process id: 0x35c0 Faulting application start time: 0x01d8f748e64ba8d6 Faulting application path: C:\BB\bladebit.exe Faulting module path: C:\BB\bladebit.exe Report Id: 518946e3-0a08-4fc1-8f37-325f9d8e95b7 Faulting package full name: Faulting package-relative application ID:

AlexGuo1998 commented 1 year ago

After digging into the code I assume this is a bug.

https://github.com/Chia-Network/bladebit/blob/9fac46aff0476e829d476412de18497a3a2f7ed8/src/plotdisk/k32/CTableWriterBounded.h#L305-L306

https://github.com/Chia-Network/bladebit/blob/9fac46aff0476e829d476412de18497a3a2f7ed8/src/plotdisk/MapWriter.h#L192-L198

Length of _mapBitCounts should be _numBuckets+ExtraBucket i.e. _numBuckets, 512. However it's defined as:

https://github.com/Chia-Network/bladebit/blob/9fac46aff0476e829d476412de18497a3a2f7ed8/src/plotdisk/k32/CTableWriterBounded.h#L355-L358

https://github.com/Chia-Network/bladebit/blob/9fac46aff0476e829d476412de18497a3a2f7ed8/src/plotdisk/DiskPlotConfig.h#L4

Oops.

Anything more than 256 buckets would crash, because uint32 _threadCount get overridden to an arbitrary large number, making jobs[i] pointing to an invalid address.

https://github.com/Chia-Network/bladebit/blob/9fac46aff0476e829d476412de18497a3a2f7ed8/src/threading/MTJob.h#L530-L534

Firing a PR...

harold-b commented 1 year ago

Thanks for taking the time to dig into this!

I'll try to have a look by tomorrow

AlexGuo1998 commented 1 year ago

BTW, I can confirm it works with my local build. No more crashes, and chia plots check passed.

Here are the CI artifacts if you want to try yourself: https://github.com/AlexGuo1998/bladebit/actions/runs/3476962769#artifacts (x86-64 only, use at your own risk!)

harold-b commented 1 year ago

Indeed I checked out your code review and you are correct. Well done! Bringing the convo over to the PR.

harold-b commented 1 year ago

Fixed in #251