Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
340 stars 107 forks source link

bladebit-cuda-v3.1.0-windows-x86-64 very slowly #448

Open valentosnik opened 6 months ago

valentosnik commented 6 months ago

With this code line: .\bladebit_cuda -f xch -c xch -z 7 -n 90 -w cudaplot --disk-128 -t1 Z:\Tmp\ -t2 Z:\Tmp\ Q:\NFT\
it takes up to 50 min for one plot. System: Win 10 pro, 128 GB Ram, Ryzen 7 5800x, RTX3070, Z:\ - Gen 4 NVMe

What do I do wrong?

Generating plot 14 / 90: a87ce0887756f8dcda26bb50dd57fc1928ce6dba1e6d6c7522b873a3ffe5912a Plot temporary file: Q:\NFT\plot-k32-c07-2023-12-28-09-40-a87ce0887756f8dcda26bb50dd57fc1928ce6dba1e6d6c7522b873a3ffe5912a.plot.tmp

Generating F1 Finished F1 in 31.75 seconds. Table 2 completed in 74.02 seconds with 4294890872 entries. Table 3 completed in 426.05 seconds with 4294837918 entries. Table 4 completed in 470.47 seconds with 4294781997 entries. Table 5 completed in 353.66 seconds with 4294636582 entries. Table 6 completed in 554.12 seconds with 4294187494 entries. Table 7 completed in 205.78 seconds with 4293486129 entries. Finalizing Table 7 Finalized Table 7 in 10.29 seconds. Completed Phase 1 in 2132.26 seconds Marked Table 6 in 11.01 seconds. Marked Table 5 in 25.55 seconds. Marked Table 4 in 18.59 seconds. Marked Table 3 in 22.13 seconds. Completed Phase 2 in 77.28 seconds Compressing Table 2 and 3... Step 1 completed step in 144.55 seconds. Step 2 completed step in 14.79 seconds. Completed table 2 in 159.34 seconds with 3439742752 / 4294837918 entries ( 80.09% ). Compressing tables 3 and 4... Step 1 completed step in 233.54 seconds. Step 2 completed step in 30.60 seconds. Step 3 completed step in 67.32 seconds. Completed table 3 in 331.46 seconds with 3465825632 / 4294781997 entries ( 80.70% ). Compressing tables 4 and 5... Step 1 completed step in 37.38 seconds. Step 2 completed step in 16.94 seconds. Step 3 completed step in 101.94 seconds. Completed table 4 in 156.28 seconds with 3532540674 / 4294636582 entries ( 82.25% ). Compressing tables 5 and 6... Step 1 completed step in 20.95 seconds. Step 2 completed step in 16.97 seconds. Step 3 completed step in 116.36 seconds. Completed table 5 in 154.29 seconds with 3712840674 / 4294187494 entries ( 86.46% ). Compressing tables 6 and 7... Step 1 completed step in 61.23 seconds. Step 2 completed step in 47.44 seconds. Step 3 completed step in 203.17 seconds. Completed table 6 in 311.86 seconds with 4293486129 / 4293486129 entries ( 100.00% ). Serializing P7 entries Completed serializing P7 entries in 42.87 seconds. Completed Phase 3 in 1156.13 seconds Completed Plot 1 in 3365.67 seconds ( 56.09 minutes )

rhcompany1337 commented 6 months ago

Assuming your drive Q: is a slow HDD, you have the issue right there. Bladebit will write parts directly to the final drive during the generation of the plot. Often the job will have to wait for your Q drive to finish writing.

To prevent this, use another fast SSD, or the same as your temp SSD as final directory. Than create another job/script to move the finished plots to the final drive. You can use plow (not working with windows atm) or robocoby etc. to move the plots. You will then face the problem that you generate plots faster than one HDD can write. Kind regards

valentosnik commented 6 months ago

I have already tried it and do not see a big difference. Maybe 5 minutes less. I can also observe, that fast SSD will be used very slow. Can it be that windows or the code reduces the speed? With Gigahorce Plotter it takes only 7 min with the same hardware...

rhcompany1337 commented 6 months ago

I remember i had some issues with the short version of the parameters. I think some didnt work. So i mostly used the long version of parameters. For Example for Windows (powershell): ./bladebit_cuda.exe -f xch -c xch --threads 14 -n 2 --compress 3 cudaplot --disk-128 -t1 G:\temp D:\plots

-t1 Frist drive is the temp second drive is the final drive (use same or different fast SSD here) But its all in one parameter.

Give it a try.

valentosnik commented 6 months ago

Still the same. What I can observe: Gigahorse uses common RAM (up to 60 GB) and bladebit do not. Only RAM use goes high...

rhcompany1337 commented 6 months ago

Can you post the terminal output thats is in front of you posted output. this might give some extra information. The output at the start before the first plot gets created

jeffmiao2016 commented 6 months ago

Me too. I have tried 256g ram and 128g+nvme, the speed is the same. System: win11

Bladebit Chia Plotter Version : 3.1.0 Git Commit : e9836f8bd963321457bc86eb5d61344bfb76dcf0 Compiled With: msvc 19.29.30152

[Global Plotting Config] Will create 1 plots. Thread count : 80 Warm start enabled : false NUMA disabled : false CPU affinity disabled : false Farmer public key : Pool contract address : Compression Level : 7 Benchmark mode : disabled

[Bladebit CUDA Plotter] Host RAM : 382 GiB Plot checks : disabled

Selected cuda device 0 : Tesla P4 CUDA Compute Capability : 6.1 SM count : 20 Max blocks per SM : 32 Max threads per SM : 2048 Async Engine Count : 1 L2 cache size : 2.00 MB L2 persist cache max size : 0.00 MB Stack Size : 1.00 KB Memory: Total : 8.00 GB Free : 7.01 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 91955994624 bytes ( 87696.07 MiB or 85.64 GiB ) Intermediate RAM required : 4378927104 bytes ( 4176.07 MiB or 4.08 GiB ) Host RAM required : 142270791680 bytes ( 135680.00 MiB or 132.50 GiB ) Total Host RAM required : 234226786304 bytes ( 223376.07 MiB or 218.14 GiB ) GPU RAM required : 6163050496 bytes ( 5877.54 MiB or 5.74 GiB ) Allocating buffers... Done.

Generating plot 1 / 1: 841186af4a31f234ea83d3546801c33ff0ed28ef62262754cb6ddb1acdce7d39 Plot temporary file: H:\plot-k32-c07-2024-01-04-22-59-841186af4a31f234ea83d3546801c33ff0ed28ef62262754cb6ddb1acdce7d39.plot.tmp

Generating F1 Progress update: 0.01 Finished F1 in 14.12 seconds. Progress update: 0.1 Table 2 completed in 53.67 seconds with 4294944749 entries. Progress update: 0.2 Table 3 completed in 88.48 seconds with 4294923861 entries. Progress update: 0.3 Table 4 completed in 107.92 seconds with 4294720598 entries. Progress update: 0.4 Table 5 completed in 107.76 seconds with 4294299996 entries. Progress update: 0.5 Table 6 completed in 93.75 seconds with 4293420175 entries. Progress update: 0.6 Table 7 completed in 70.86 seconds with 4291904047 entries. Progress update: 0.7 Finalizing Table 7 Finalized Table 7 in 30.77 seconds. Completed Phase 1 in 568.55 seconds Progress update: 0.8 Marked Table 6 in 20.98 seconds. Marked Table 5 in 18.33 seconds. Marked Table 4 in 17.62 seconds. Marked Table 3 in 18.09 seconds. Completed Phase 2 in 75.02 seconds Progress update: 0.9 Compressing Table 2 and 3... Step 1 completed step in 21.54 seconds. Step 2 completed step in 30.81 seconds. Completed table 2 in 52.35 seconds with 3439716041 / 4294923861 entries ( 80.09% ). Compressing tables 3 and 4... Step 1 completed step in 19.71 seconds. Step 2 completed step in 39.19 seconds. Step 3 completed step in 38.05 seconds. Completed table 3 in 96.95 seconds with 3465652916 / 4294720598 entries ( 80.70% ). Compressing tables 4 and 5... Step 1 completed step in 20.05 seconds. Step 2 completed step in 39.52 seconds. Step 3 completed step in 38.39 seconds. Completed table 4 in 97.95 seconds with 3532022496 / 4294299996 entries ( 82.25% ). Compressing tables 5 and 6... Step 1 completed step in 20.42 seconds. Step 2 completed step in 40.77 seconds. Step 3 completed step in 40.19 seconds. Completed table 5 in 101.38 seconds with 3711947644 / 4293420175 entries ( 86.46% ). Compressing tables 6 and 7... Step 1 completed step in 20.46 seconds. Step 2 completed step in 44.67 seconds. Step 3 completed step in 48.47 seconds. Completed table 6 in 113.61 seconds with 4291904047 / 4291904047 entries ( 100.00% ). Serializing P7 entries Completed serializing P7 entries in 27.39 seconds. Completed Phase 3 in 489.64 seconds Progress update: 0.95 Completed Plot 1 in 1133.21 seconds ( 18.89 minutes )

H:\plot-k32-c07-2024-01-04-22-59-841186af4a31f234ea83d3546801c33ff0ed28ef62262754cb6ddb1acdce7d39.plot.tmp -> H:\plot-k32-c07-2024-01-04-22-59-841186af4a31f234ea83d3546801c33ff0ed28ef62262754cb6ddb1acdce7d39.plot Completed writing plot in 0.07 seconds Final plot table pointers: Table 1: 0 ( 0x0000000000000000 ) Table 2: 1289294040 ( 0x000000004cd910d8 ) Table 3: 5068279290 ( 0x000000012e17cdfa ) Table 4: 19155960840 ( 0x0000000475c8c408 ) Table 5: 33513430665 ( 0x00000007cd8e5e89 ) Table 6: 48602285040 ( 0x0000000b50ec03f0 ) Table 7: 66048629565 ( 0x0000000f60ce1b3d ) C 1 : 4096 ( 0x0000000000001000 ) C 2 : 1720864 ( 0x00000000001a4220 ) C 3 : 1721040 ( 0x00000000001a42d0 )

Final plot table sizes: Table 1: 0.00 MiB Table 2: 3603.92 MiB Table 3: 13435.06 MiB Table 4: 13692.35 MiB Table 5: 14389.85 MiB Table 6: 16638.13 MiB Table 7: 16883.96 MiB C 1 : 1.64 MiB C 2 : 0.00 MiB C 3 : 1227.93 MiB

valentosnik commented 6 months ago

with this code: .\bladebit_cuda.exe -f xch -c xch --threads 14 -n 1 --compress 3 cudaplot --disk-128 -t1 Z:\TMP\ Z:\NFT\

Bladebit Chia Plotter Version : 3.1.0 Git Commit : e9836f8bd963321457bc86eb5d61344bfb76dcf0 Compiled With: msvc 19.29.30152

[Global Plotting Config] Will create 1 plots. Thread count : 14 Warm start enabled : false NUMA disabled : false CPU affinity disabled : false Farmer public key : xch Pool contract address : xch Compression Level : 3 Benchmark mode : disabled

[Bladebit CUDA Plotter] Host RAM : 127 GiB Plot checks : disabled

Selected cuda device 0 : NVIDIA GeForce RTX 3070 CUDA Compute Capability : 8.6 SM count : 46 Max blocks per SM : 16 Max threads per SM : 1536 Async Engine Count : 5 L2 cache size : 4.00 MB L2 persist cache max size : 3.00 MB Stack Size : 1.00 KB Memory: Total : 8.00 GB Free : 6.93 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 92405843664 bytes ( 88125.08 MiB or 86.06 GiB ) Intermediate RAM required : 4378927104 bytes ( 4176.07 MiB or 4.08 GiB ) Host RAM required : 28420603904 bytes ( 27104.00 MiB or 26.47 GiB ) Total Host RAM required : 120826447568 bytes ( 115229.08 MiB or 112.53 GiB ) GPU RAM required : 6163857408 bytes ( 5878.31 MiB or 5.74 GiB ) Allocating buffers... Done.

Generating plot 1 / 1: 86f5af3f8c8fd54db8626565b11fb072f47f9d5ec412b37208094a4612d7528e Plot temporary file: Z:\NFT\plot-k32-c03-2024-01-07-17-52-86f5af3f8c8fd54db8626565b11fb072f47f9d5ec412b37208094a4612d7528e.plot.tmp

Generating F1 Finished F1 in 5.99 seconds. Table 2 completed in 119.34 seconds with 4294959390 entries. Table 3 completed in 338.96 seconds with 4294941788 entries. Table 4 completed in 358.90 seconds with 4294967296 entries. Table 5 completed in 334.62 seconds with 4294912943 entries. Table 6 completed in 373.58 seconds with 4294905795 entries. Table 7 completed in 235.77 seconds with 4294730296 entries. Finalizing Table 7 Finalized Table 7 in 93.18 seconds. Completed Phase 1 in 1862.40 seconds Marked Table 6 in 26.66 seconds. Marked Table 5 in 20.98 seconds. Marked Table 4 in 10.01 seconds. Marked Table 3 in 10.71 seconds. Completed Phase 2 in 68.36 seconds Compressing Table 2 and 3... Step 1 completed step in 85.92 seconds. Step 2 completed step in 38.15 seconds. Completed table 2 in 124.06 seconds with 3439892460 / 4294941788 entries ( 80.09% ). Compressing tables 3 and 4... Step 1 completed step in 209.68 seconds. Step 2 completed step in 72.98 seconds. Step 3 completed step in 48.50 seconds. Completed table 3 in 331.17 seconds with 3466118706 / 4294967296 entries ( 80.70% ). Compressing tables 4 and 5... Step 1 completed step in 32.51 seconds. Step 2 completed step in 24.81 seconds. Step 3 completed step in 25.57 seconds. Completed table 4 in 82.89 seconds with 3532951205 / 4294912943 entries ( 82.26% ). Compressing tables 5 and 6... Step 1 completed step in 29.30 seconds. Step 2 completed step in 24.53 seconds. Step 3 completed step in 41.85 seconds. Completed table 5 in 95.68 seconds with 3713619268 / 4294905795 entries ( 86.47% ). Compressing tables 6 and 7... Step 1 completed step in 35.02 seconds. Step 2 completed step in 27.61 seconds. Step 3 completed step in 63.85 seconds. Completed table 6 in 126.48 seconds with 4294730296 / 4294730296 entries ( 100.00% ). Serializing P7 entries Completed serializing P7 entries in 9.17 seconds. Completed Phase 3 in 769.49 seconds Completed Plot 1 in 2700.26 seconds ( 45.00 minutes )

Z:\NFT\plot-k32-c03-2024-01-07-17-52-86f5af3f8c8fd54db8626565b11fb072f47f9d5ec412b37208094a4612d7528e.plot.tmp -> Z:\NFT\plot-k32-c03-2024-01-07-17-52-86f5af3f8c8fd54db8626565b11fb072f47f9d5ec412b37208094a4612d7528e.plot Completed writing plot in 39.16 seconds Final plot table pointers: Table 1: 0 ( 0x0000000000000000 ) Table 2: 1290144172 ( 0x000000004ce609ac ) Table 3: 11959185692 ( 0x00000002c8d2b11c ) Table 4: 26048757017 ( 0x0000000610a07d19 ) Table 5: 40409998067 ( 0x00000009689fa2f3 ) Table 6: 55505645642 ( 0x0000000cec64f04a ) Table 7: 72963478667 ( 0x00000010fcf6548b ) C 1 : 4096 ( 0x0000000000001000 ) C 2 : 1721996 ( 0x00000000001a468c ) C 3 : 1722172 ( 0x00000000001a473c )

Final plot table sizes: Table 1: 0.00 MiB Table 2: 10174.79 MiB Table 3: 13436.86 MiB Table 4: 13695.95 MiB Table 5: 14396.33 MiB Table 6: 16649.09 MiB Table 7: 16895.07 MiB C 1 : 1.64 MiB C 2 : 0.00 MiB C 3 : 1228.73 MiB

rhcompany1337 commented 6 months ago

I don't see anything obvious other than your times being much to high. So here are some obvious thinks to check:

Nvidia driver up do date? I know they had issues with older drivers and bladebit. Worth a check!

Check you NVME SSD speed. e.g. Crystal Disk Mark. Its very odd to me that the step " Completed writing plot" took 39.16 seconds for you. That writing/copy is on the same ssd. It only is the change of a pointer in your file system. It should take like a second.

If you run bladebit in powershell (my suggestion) you should run powershell with admin rights.

Also odd, but probably nothing: I don't use the last backslash in the command path Yours: Z:\NFT\ Mine: Z:\NFT

It might also help to open windows recource manager while plotting to locate where the bottleneck is. Have a look at cpu, gpu and ssd usage.

The second entry wit a time "Table 2 completed in" should be at something like 20 seconds or less.

rhcompany1337 commented 6 months ago

oh and a warning. check your plots when done. like deep check them. don't settle for the default 30 checks. go like 100 or 200 checks. I had so many bad plots in an earlier version. they all would pass the 30 checks. but going to 200 they showed to be faulty.

valentosnik commented 6 months ago

Thank you for ideas. Hmmm... I use power shell with admin rights. tried without backslash --> same. Driver for Grafik is one of the latest. NVMe writings are very slowly. around 300 MB/s. The same hardware with gigahorse plotter up to 4 GB/s. No idea. Bladebit is on my system very slowly. I will try it on another PC with Linux...

LeroyINC commented 6 months ago

i think there is some issue sometimes with slow writes to NVMe disks. have you tired do turn off direct-io? that is a command line switch option for blade bit itself to add to your blade-bit command.

valentosnik commented 5 months ago

.\bladebit_cuda.exe -f xch -c xch --no-direct-io --threads 14 -n 1000 --compress 3 cudaplot --disk-128 -t1 Z:\TMP Z:\NFT same shit with no direct io...