Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
339 stars 109 forks source link

Missing tables (corrupt plots), plot serialization #315

Open jmhands opened 1 year ago

jmhands commented 1 year ago

running C9, plots get corrupt after a few hundred -n bladebit/build-release/bladebit_cuda -f farmer -c contract -n 1 --compress 9 cudaplot /mnt/ssd

System config

OS="Ubuntu 22.04.2 LTS 6.2.6-060206-generic"
SYSVENDOR_MODEL="Supermicro X11"
CPU="2x Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (C:10|T:40)"
GPU="NVIDIA GeForce RTX 3060 Ti"
GPU_SKU="Ampere"
GPU_ARCH="900-1G142-2520-000"
GPU_DRIVER="530.30.02"
PCIE_LINK="PCIe 3x16"
DRAM="251.515GB, 4x64GB4xSize:64"

alpha3 ok 27fdd38ba9c9a5799962fdfee68d40c5a91d0547 - C9plot_output.txt corrupt plots

-rw-rw-r-- 1 jm jm 80799203328 Mar 24 15:18 plot-k32-c09-2023-03-24-15-09-c3dd5be80d633860e0b03609132894d02c6c4ece8e017f9b5e10b27a9f96e9d1.plot
-rw-rw-r-- 1 jm jm 80785190912 Mar 24 15:38 plot-k32-c09-2023-03-24-15-26-978af86ff77d67383aeb7e9fad71c4185baadc18338ee09f4d19ac2191cdac21.plot
-rw-rw-r-- 1 jm jm 80767299584 Mar 24 15:58 plot-k32-c09-2023-03-24-15-45-fb7bc8f370695ac06249162bc3ab93671c11f6d42452201dff9868968ab3523a.plot
-rw-rw-r-- 1 jm jm 80802201600 Mar 24 16:18 plot-k32-c09-2023-03-24-16-05-f58dd66689090c0f6c5b4beeade3c28143c16decb541118d583415434b18fe0c.plot
-rw-rw-r-- 1 jm jm 71266349056 Mar 24 16:36 plot-k32-c09-2023-03-24-16-25-3d6d5bc30298189164bfdbf66f8e0178d4e965e1856575c39effaf741e2b2e46.plot
-rw-rw-r-- 1 jm jm 73242918912 Mar 24 16:56 plot-k32-c09-2023-03-24-16-45-6a1e4615515b7c2736b7561d8803f04ab5fb8c5e015b1528ffd2669dab483a8f.plot
-rw-rw-r-- 1 jm jm 76033470464 Mar 24 17:17 plot-k32-c09-2023-03-24-17-06-67e7f6b88565c888d3275a7dd9f4a76f42165627fb06f1ad0e66127c8154753b.plot
-rw-rw-r-- 1 jm jm 82017386496 Mar 24 17:39 plot-k32-c09-2023-03-24-17-26-6023b20633349fcba78adb610f706c850d1f064d7dafe0a057e16bb719a78670.plot
-rw-rw-r-- 1 jm jm 70684033024 Mar 24 17:57 plot-k32-c09-2023-03-24-17-47-2fd0e9c7af3e8c35f53e084b8102c6a12ec9dced0abcccde419a257db92f4e4a.plot
-rw-rw-r-- 1 jm jm 70843801600 Mar 24 18:20 plot-k32-c09-2023-03-24-18-07-2c630725676fad47ff3af82e4146aaae511b9d3e587bc39d1ab7c614991842ae.plot
-rw-rw-r-- 1 jm jm 83202064384 Mar 24 18:40 plot-k32-c09-2023-03-24-18-25-f267fd21115218820a365a4f7012aecee5fa83385519f87d552cd5c648c94b15.plot
-rw-rw-r-- 1 jm jm 76281200640 Mar 24 18:59 plot-k32-c09-2023-03-24-18-46-245bb390f74e7ea8e889c9893a99a8786503c22410d636ad02cfd7b9713e0159.plot
DaOneLuna commented 1 year ago

I have seen this with Non-compressed plots across multiple systems.


OS="Ubuntu 22.04.1 LTS 5.15.0-60-generic"
SYSVENDOR_MODEL="Supermicro SYS-1029U-TN10RT"
CPU="2x Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz (C:16|T:64)"
GPU="NVIDIA RTX A4000"
GPU_SKU="Ampere"
GPU_ARCH="900-5G190-0100-001"
GPU_DRIVER="525.85.12"
PCIE_LINK="PCIe 1x16"
DRAM="376.568GB, 24x16GB24xSize:16 DDR4, 3200MT/s"```
toto99303 commented 1 year ago

Not sure if it's the same issue, but I have every 3-4 out of 10 plots invalid with -n 30. I'm plotting with NVIDIA RTX A4000 with 0 compression.

GPU Driver Version: 530.41.03
GPU="NVIDIA RTX A4000"
GPU_SKU="Ampere"
Ram: 512GB (16x32GB DDR3 ECC 1600Mhz)
CPU: 2x Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

P.S. As a followup I can confirm this is the same issue as I get 15 out of 51 plots invalid when I check them with -n 500. -n 30 returns only 1 invalid plot of those 51.