Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
340 stars 107 forks source link

Periodic panic when cuda plotting with 3.1.0-rc2 #418

Open llowrey opened 9 months ago

llowrey commented 9 months ago

When running cudaplot and producing c07 plots I get a panic fairly regularly. Here's what I see:

Seed used: 0xe323c0f230a83863a37cb136b4db4c88d600c1cbff549e5907eaec678b02d71e
Proofs requested/fetched: 35 / 100 ( 35.000% )
Proof fetches failed    : 60 ( 60.000% )

WARNING: Deleting plot '/mnt/plots/plot-k32-c07-2023-09-24-01-03-e451f4bc253ca772d5c941fb7ed71cbad1907c5710c67048c2d880675bf2256d.plot.tmp' as it failed to fetch some proofs. This might indicate corrupt plot file.

Completed writing plot in 72.75 seconds
Generating plot 8: 703a66d5a19a1173b863fe3d2ed1fe562aaf8d8ca74826846ddaa81c60088f6e
Plot temporary file: /mnt/plots/plot-k32-c07-2023-09-24-01-10-703a66d5a19a1173b863fe3d2ed1fe562aaf8d8ca74826846ddaa81c60088f6e.plot.tmp

CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

*** Panic!!! *** Fatal Error:
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered.
/home/llowrey/bladebit_cuda(_ZN7SysHost14DumpStackTraceEv+0x3b)[0x4c7a4b]
/home/llowrey/bladebit_cuda(_Z9PanicExitv+0x9)[0x6450b9]
/home/llowrey/bladebit_cuda[0x47675b]
/home/llowrey/bladebit_cuda(_ZN14CudaK32Plotter3RunERK11PlotRequest+0x5f3)[0x47b8d3]
/home/llowrey/bladebit_cuda(main+0xa67)[0x473ea7]
/lib64/libc.so.6(+0x27510)[0x7fb72b224510]
/lib64/libc.so.6(__libc_start_main+0x89)[0x7fb72b2245c9]

I also see this in dmesg:

[33389.606189] NVRM: GPU at PCI:0000:06:00: GPU-24106a8f-6cbb-0623-97ed-f00643abf6ac
[33389.606226] NVRM: Xid (PCI:0000:06:00): 31, pid=13181, name=bladebit_cuda, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7fb7_2bba8000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

I'm running --check 100 --check-threshold 0.8 and while I do see plots deleted periodically it's only when the line starting with Proof fetches failed is output that the panic then immediately happens.

I have two identical systems (as described below) and both panic at about the same frequency which is between 10 and 20 plots. When I plotted with the alphas and then 3.0.0 I had about 7.75% of plots turn out to be bad. That's about 1 every 13. That's very consistent with what I'm seeing with panics every 10-20 plots.

CPU: Opteron 32c RAM: 256GB DDR3 ECC GPU: 1070 (PCIe2 x16 due to old Opteron platform) OS: Fedora 37 Kernel: 6.4.9-100.fc37.x86_64 Driver: 535.86.10 CUDA: 12.2 Bladebit: 3.1.0-rc2

Coyote-UK commented 9 months ago

Ubuntu Server 23.04 sample of approximately 20 runs: 100% bad plots & 100% Panic trying to launch second plot:

./bladebit_cuda -f b[redacted] -c [redacted] -n 2 -z 5 cudaplot --check 100 --check-threshold 0.8 /mnt/scratch

Bladebit Chia Plotter Version : 3.1.0-rc2 Git Commit : 31eba697164efeb29532805b74df00f4ffadcf60 Compiled With: gcc 9.4.0

[Global Plotting Config] Will create 2 plots. Thread count : 28 Warm start enabled : false NUMA disabled : false CPU affinity disabled : false Farmer public key : [redacted] Pool contract address : [redacted] Compression Level : 5 Benchmark mode : disabled

[Bladebit CUDA Plotter] Host RAM : 251 GiB Plot checks : enabled ( 100 ) Plot check threshold: 0.800

Selected cuda device 0 : Quadro M5000 CUDA Compute Capability : 5.2 SM count : 16 Max blocks per SM : 32 Max threads per SM : 2048 Async Engine Count : 2 L2 cache size : 2.00 MB L2 persist cache max size : 0.00 MB Stack Size : 1.00 KB Memory: Total : 7.92 GB Free : 7.85 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 88026990288 bytes ( 83949.08 MiB or 81.98 GiB ) Intermediate RAM required : 73728 bytes ( 0.07 MiB or 0.00 GiB ) Host RAM required : 142270791680 bytes ( 135680.00 MiB or 132.50 GiB ) Total Host RAM required : 230297781968 bytes ( 219629.08 MiB or 214.48 GiB ) GPU RAM required : 6158610432 bytes ( 5873.31 MiB or 5.74 GiB ) Allocating buffers... Done.

Generating plot 1 / 2: 4b4a032924d44ad8e281d6a2109471f84c5864e602c09f04af1bd4f479d3c78a Plot temporary file: /mnt/scratch/plot-k32-c05-2023-09-25-13-53-4b4a032924d44ad8e281d6a2109471f84c5864e602c09f04af1bd4f479d3c78a.plot.tmp

Generating F1 Finished F1 in 3.81 seconds. Table 2 completed in 28.71 seconds with 4294890376 entries. Table 3 completed in 42.37 seconds with 4294764418 entries. Table 4 completed in 45.29 seconds with 4294493739 entries. Table 5 completed in 47.03 seconds with 4293960744 entries. Table 6 completed in 43.70 seconds with 4293008991 entries. Table 7 completed in 35.82 seconds with 4290937926 entries. Finalizing Table 7 Finalized Table 7 in 13.16 seconds. Completed Phase 1 in 259.89 seconds Marked Table 6 in 49.97 seconds. Marked Table 5 in 39.78 seconds. Marked Table 4 in 37.47 seconds. Marked Table 3 in 36.62 seconds. Completed Phase 2 in 163.84 seconds Compressing Table 2 and 3... Step 1 completed step in 4.56 seconds. Step 2 completed step in 18.70 seconds. Completed table 2 in 23.25 seconds with 3439556808 / 4294764418 entries ( 80.09% ). Compressing tables 3 and 4... Step 1 completed step in 4.49 seconds. Step 2 completed step in 11.66 seconds. Step 3 completed step in 20.45 seconds. Completed table 3 in 36.60 seconds with 3465423393 / 4294493739 entries ( 80.69% ). Compressing tables 4 and 5... Step 1 completed step in 4.52 seconds. Step 2 completed step in 11.78 seconds. Step 3 completed step in 20.86 seconds. Completed table 4 in 37.16 seconds with 3531720987 / 4293960744 entries ( 82.25% ). Compressing tables 5 and 6... Step 1 completed step in 4.61 seconds. Step 2 completed step in 12.27 seconds. Step 3 completed step in 21.92 seconds. Completed table 5 in 38.80 seconds with 3711471700 / 4293008991 entries ( 86.45% ). Compressing tables 6 and 7... Step 1 completed step in 5.31 seconds. Step 2 completed step in 13.84 seconds. Step 3 completed step in 25.94 seconds. Completed table 6 in 45.09 seconds with 4290937926 / 4290937926 entries ( 100.00% ). Serializing P7 entries Completed serializing P7 entries in 7.72 seconds. Completed Phase 3 in 188.63 seconds Completed Plot 1 in 612.36 seconds ( 10.21 minutes )

Checking 100 random proofs with seed 0xa37b13f91aa8c3f9077fb5f248e03eb19fa5cac376d80a09e48b32a414facff9... Plot compression level: 5 10%... 20%... 30%... 40%... 50%... 60%... 70%... 80%... 90%... Seed used: 0xa37b13f91aa8c3f9077fb5f248e03eb19fa5cac376d80a09e48b32a414facff9 Proofs requested/fetched: 13 / 100 ( 13.000% ) Proof fetches failed : 87 ( 87.000% )

WARNING: Deleting plot '/mnt/scratch/plot-k32-c05-2023-09-25-13-53-4b4a032924d44ad8e281d6a2109471f84c5864e602c09f04af1bd4f479d3c78a.plot.tmp' as it failed to fetch some proofs. This might indicate corrupt plot file. Completed writing plot in 46.84 seconds Generating plot 2 / 2: eb57a05ddc70fc2190515b4f2393a7b257b59865e01b0717154134949d91057c Plot temporary file: /mnt/scratch/plot-k32-c05-2023-09-25-14-04-eb57a05ddc70fc2190515b4f2393a7b257b59865e01b0717154134949d91057c.plot.tmp

CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

Panic!!! Fatal Error:
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered. ./bladebit_cuda(_ZN7SysHost14DumpStackTraceEv+0x5b)[0x5648a360a30b] ./bladebit_cuda(_Z9PanicExitv+0xf)[0x5648a37966df] ./bladebit_cuda(+0x7e55f)[0x5648a35b555f] ./bladebit_cuda(_ZN14CudaK32Plotter3RunERK11PlotRequest+0x62b)[0x5648a35ba9eb] ./bladebit_cuda(main+0xaed)[0x5648a35b286d] /lib/x86_64-linux-gnu/libc.so.6(+0x23a90)[0x7f3eb8e23a90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x89)[0x7f3eb8e23b49] ./bladebit_cuda(_start+0x2e)[0x5648a35b402e]

Coyote-UK commented 9 months ago

Detail system spec:

Dell Precision 5810 Tower E5-2690 V4 256GB DDR4-19200 ECC NVidia Quadro M5000 Patriot Viper VPN100 512GB /mnt/scratch type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) Glotrends PCIe Gen4 M.2 adaptor Ubuntu Server 23.04 Nvidia Driver 535.113.01

Also tried to run validate & ...

./bladebit_cuda validate -u --cuda /mnt/scratch/*.plot Validating plot /mnt/scratch/plot-k32-c05-2023-09-29-10-46-63aa3d86d0e32dca3a1f33b9e043fac8de4ef4af6b3e6b8bcf05c18decefe2ee.plot K : 32 Unpacked : true Maximum C3 Parks: 429411

Unpacking f7 values... Actual C3 Parks : 429411 Reding park 7... Loading back pointer tables... Loading table 6 Loading table 5 Loading table 4 Loading table 3 Loading table 2 Loading table 1 Floating point exception (core dumped)

./bladebit_cuda -f [Farmer Key] -c [Pool Contract] -n 1 -z 5 cudaplot [Output Folder] - results in very poor quality plot

./bladebit_cuda -f [Farmer Key] -c [Pool Contract] -n 2 -z 5 cudaplot --check 100 --check-threshold 0.8 [Output Folder] - results in 1 very poor quality plot, which gets automatically deleted, followed by the Panic

./bladebit_cuda check [Output Folder]/*.plot - confirms plots are poor quality

I then created 2 new plots, using bladebit_cuda 3.0.0-alpha4 on exactly the same hardware/os/driver platform:

There were no memory errors, and both plots created fine.

A check produced 98% valid proofs according to the check in that version & when I used 3.1.0-rc2 to check the same plot it returned only 14% valid proofs.

Ron-ski commented 9 months ago

I have a system with a 3080 and a P4, when creating plots on the P4 most are deleted due to "Proof fetches failed", although I don't get a panic error and plotting contiues. This doesn't happen when plotting on the 3080. I don't know if this is a recent problem, as the majority of my plots were created on the 3080, but I have created some on the P4.

System: See attached spec. Plotting log: See attached file.

The attached plotting log has plot runs from both the 3080 and the P4. Plotting Log.txt System Spec.txt

Ron-ski commented 9 months ago

I've just created 4 plots on my P4 with v3.1.0-beta1, then run Bladebit_cuda check -n 100 against each of them, and they all passed. I'll try another night with v3.1.0-rc1 and see if its any different.

Coyote-UK commented 9 months ago

Reacting to Ron-ski's successful test I ran my tests again with v3.1.0-beta1.

I was able to create multiple plots with this version.

Validate produced the same floating point exception as rc2.

I ran a check on the resulting plots with n 1000. This returned 12 / 1000 (1.20%) valid proofs found.

Strangely, If I run bladebit_cuda check v3.0.0-alpha4 for the same plot I get 990 / 1000 (99.00%) valid proofs found.

Coyote-UK commented 9 months ago

So it looks like the check & validate functions are the broken parts. (At least for me)

I created 5 new plots using v3.1.0-rc2 and tested them first with bladebit_cuda_3.0.0-alpha4 and then bladebit_cuda_3.1.0-rc2 both with n 1000.

3.0.0-alpha4 returned 959 / 1000 (95.90%) valid proofs found. 3.1.0-rc2 returned Proofs requested/fetched: 15 / 1000 ( 1.500% ) Proof fetches failed : 946 ( 94.600% )

for exactly the same plot

Ron-ski commented 9 months ago

Given the above I tested three plots with Bladebit_Cuda check using my P4 and 3.1.0 rc2, all three passed.

I then created a plot with RC2 and the P4, and had it checked immediately after, and it failed with Proof fetches failed : 1 (1.000%).

Now whilst it was being checked I made a copy of the plot. I tested that plot and it passed.

So exactly the same plot fails the test carried out immediately after plotting, but passes when tested with Bladebit_Cuda Check.

harold-b commented 9 months ago

Did you guys build rc2 yourselves or downloaded it from the release page?

llowrey commented 9 months ago

Did you guys build rc2 yourselves or downloaded it from the release page?

Release page, bladebit-cuda-v3.1.0-rc2-centos-x86-64.tar.gz

Ron-ski commented 9 months ago

Did you guys build rc2 yourselves or downloaded it from the release page?

No, downloaded from the release page bladebit-cuda-v3.1.0-rc2-ubuntu-x86-64.tar.gz

harold-b commented 9 months ago

It may be that the CI artifacts might have issues. So we can try with a different executable. I will post here when I have one ready for testing, to see if that is the cause of the discrepancy

harold-b commented 9 months ago

The current issue in this thread appears to be different than the OP. So perhaps someone open a new issue.

For those who seem to have a different check output by version, would you please test with the artifacts from this run?

https://github.com/Chia-Network/bladebit/actions/runs/6397686875

Coyote-UK commented 9 months ago

I downloaded from the release page also.

I believe the confusion with the original post & the check discussion comes from trying to diagnose the problem with the limited tools/knowledge we have, I am sure we would appreciate some validation but we seem to have narrowed it down to the check element.

The 'Panic' referred to in the original message manifests when using the automatic check facility with multiple plots.

Coyote-UK commented 9 months ago

Issue persists in 3.1.0 release

Wed 4 Oct 13:52:14 BST 2023 Creating 2 C5 plots

Bladebit Chia Plotter Version : 3.1.0 Git Commit : e9836f8bd963321457bc86eb5d61344bfb76dcf0 Compiled With: gcc 9.4.0

[Global Plotting Config] Will create 2 plots. Thread count : 28 Warm start enabled : false NUMA disabled : false CPU affinity disabled : false Farmer public key : [Redacted] Pool contract address : [Redacted] Compression Level : 5 Benchmark mode : disabled

[Bladebit CUDA Plotter] Host RAM : 251 GiB Plot checks : enabled ( 100 ) Plot check threshold: 0.800

Selected cuda device 0 : Quadro M5000 CUDA Compute Capability : 5.2 SM count : 16 Max blocks per SM : 32 Max threads per SM : 2048 Async Engine Count : 2 L2 cache size : 2.00 MB L2 persist cache max size : 0.00 MB Stack Size : 1.00 KB Memory: Total : 7.92 GB Free : 7.85 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 88026990288 bytes ( 83949.08 MiB or 81.98 GiB ) Intermediate RAM required : 73728 bytes ( 0.07 MiB or 0.00 GiB ) Host RAM required : 142270791680 bytes ( 135680.00 MiB or 132.50 GiB ) Total Host RAM required : 230297781968 bytes ( 219629.08 MiB or 214.48 GiB ) GPU RAM required : 6158610432 bytes ( 5873.31 MiB or 5.74 GiB ) Allocating buffers... Done.

Generating plot 1 / 2: fd2029fbe733ab1235f5cab463cc24121ebb42e7aabfc1397563f3ea9a0c0c51 Plot temporary file: /mnt/scratch/plot-k32-c05-2023-10-04-13-53-fd2029fbe733ab1235f5cab463cc24121ebb42e7aabfc1397563f3ea9a0c0c51.plot.tmp

Generating F1 Finished F1 in 3.81 seconds. Table 2 completed in 28.69 seconds with 4294878659 entries. Table 3 completed in 42.35 seconds with 4294641477 entries. Table 4 completed in 45.27 seconds with 4294340859 entries. Table 5 completed in 47.01 seconds with 4293642371 entries. Table 6 completed in 43.68 seconds with 4292372113 entries. Table 7 completed in 35.80 seconds with 4289659730 entries. Finalizing Table 7 Finalized Table 7 in 13.15 seconds. Completed Phase 1 in 259.76 seconds Marked Table 6 in 49.95 seconds. Marked Table 5 in 39.77 seconds. Marked Table 4 in 37.43 seconds. Marked Table 3 in 36.62 seconds. Completed Phase 2 in 163.78 seconds Compressing Table 2 and 3... Step 1 completed step in 4.56 seconds. Step 2 completed step in 18.68 seconds. Completed table 2 in 23.24 seconds with 3439338416 / 4294641477 entries ( 80.08% ). Compressing tables 3 and 4... Step 1 completed step in 4.49 seconds. Step 2 completed step in 11.66 seconds. Step 3 completed step in 20.46 seconds. Completed table 3 in 36.60 seconds with 3465123579 / 4294340859 entries ( 80.69% ). Compressing tables 4 and 5... Step 1 completed step in 4.50 seconds. Step 2 completed step in 11.78 seconds. Step 3 completed step in 20.85 seconds. Completed table 4 in 37.12 seconds with 3531259588 / 4293642371 entries ( 82.24% ). Compressing tables 5 and 6... Step 1 completed step in 4.59 seconds. Step 2 completed step in 12.27 seconds. Step 3 completed step in 21.90 seconds. Completed table 5 in 38.76 seconds with 3710716768 / 4292372113 entries ( 86.45% ). Compressing tables 6 and 7... Step 1 completed step in 5.31 seconds. Step 2 completed step in 13.83 seconds. Step 3 completed step in 25.93 seconds. Completed table 6 in 45.07 seconds with 4289659730 / 4289659730 entries ( 100.00% ). Serializing P7 entries Completed serializing P7 entries in 7.72 seconds. Completed Phase 3 in 188.53 seconds Completed Plot 1 in 612.07 seconds ( 10.20 minutes )

Checking 100 random proofs with seed 0x4cbc56dabfe427ef91fc95210f059f0b1021620fc3c91ef1076ffcc2f6028b46... Plot compression level: 5 10%... 20%... 30%... 40%... 50%... 60%... 70%... 80%... 90%... Seed used: 0x4cbc56dabfe427ef91fc95210f059f0b1021620fc3c91ef1076ffcc2f6028b46 Proofs requested/fetched: 34 / 100 ( 34.000% ) Proof fetches failed : 72 ( 72.000% )

WARNING: Deleting plot '/mnt/scratch/plot-k32-c05-2023-10-04-13-53-fd2029fbe733ab1235f5cab463cc24121ebb42e7aabfc1397563f3ea9a0c0c51.plot.tmp' as it failed to fetch some proofs. This might indicate corrupt plot file. Completed writing plot in 189.89 seconds Generating plot 2 / 2: ae6b2e164c41a8c230803d106a5cf2ac7072989c7c0acdd63e101ad2b749dc61 Plot temporary file: /mnt/scratch/plot-k32-c05-2023-10-04-14-06-ae6b2e164c41a8c230803d106a5cf2ac7072989c7c0acdd63e101ad2b749dc61.plot.tmp

CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

Panic!!! Fatal Error:
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered. ./bladebit_cuda(_ZN7SysHost14DumpStackTraceEv+0x5b)[0x564df4c4e30b] ./bladebit_cuda(_Z9PanicExitv+0xf)[0x564df4dda6df] ./bladebit_cuda(+0x7e55f)[0x564df4bf955f] ./bladebit_cuda(_ZN14CudaK32Plotter3RunERK11PlotRequest+0x62b)[0x564df4bfe9eb] ./bladebit_cuda(main+0xaed)[0x564df4bf686d] /lib/x86_64-linux-gnu/libc.so.6(+0x23a90)[0x7fccb6023a90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x89)[0x7fccb6023b49] ./bladebit_cuda(_start+0x2e)[0x564df4bf802e]

Plot created using 3.1.0 check with 3.0.0-alpha4 967 / 1000 (96.70%) valid proofs found. check with 3.1.0 Proofs requested/fetched: 7 / 1000 ( 0.700% ) Proof fetches failed : 945 ( 94.500% )

Ron-ski commented 9 months ago

I don't get the panic error. Using the version linked above I set it going making and checking five plots, four out the five were deleted with "Proof fetches failed : 1 ( 1.000% )", the second plot passed with a score of 80%

Plotting Log 3.1.0.txt

sobertram commented 9 months ago

I have a Tesla P4 and not seeing any of these issues. I notice that the folks posting here are using non hybrid cuda mode. I am using 128 hybrid mode:

[Bladebit CUDA Plotter]
 Host RAM            : 125 GiB
 Plot checks         : enabled ( 100 )
 Plot check threshold: 0.600

Selected cuda device 0 : Tesla P4
 CUDA Compute Capability   : 6.1
 SM count                  : 20
 Max blocks per SM         : 32
 Max threads per SM        : 2048
 Async Engine Count        : 2
 L2 cache size             : 2.00 MB
 L2 persist cache max size : 0.00 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 7.43 GB
  Free                     : 6.80 GB

Allocating buffers (this may take a few seconds)...
Kernel RAM required       : 89669060304  bytes ( 85515.08  MiB or 83.51  GiB )
Intermediate RAM required : 1642143744   bytes ( 1566.07   MiB or 1.53   GiB )
Host RAM required         : 28420603904  bytes ( 27104.00  MiB or 26.47  GiB )
Total Host RAM required   : 118089664208 bytes ( 112619.08 MiB or 109.98 GiB )
GPU RAM required          : 6163050496   bytes ( 5877.54   MiB or 5.74   GiB )
Allocating buffers...
Done.

Proofs:

grep "^Proof" bb_farm10.log
Proofs requested/fetched: 111 / 100 ( 111.000% )
Proofs requested/fetched: 79 / 100 ( 79.000% )
Proofs requested/fetched: 90 / 100 ( 90.000% )
Proofs requested/fetched: 81 / 100 ( 81.000% )
Proofs requested/fetched: 99 / 100 ( 99.000% )
Proofs requested/fetched: 91 / 100 ( 91.000% )
Proofs requested/fetched: 78 / 100 ( 78.000% )
Proofs requested/fetched: 110 / 100 ( 110.000% )

Those are all C7 plots.

So I wonder if the issue resides in the differences between hybrid plotting and non hybrid plotting?

I also downloaded and installed from artifacts, bladebit-cuda-v3.1.0-rc2-ubuntu-x86-64.tar.gz.