PlasmaPower / nano-vanity

A NANO vanity address generator (supports OpenCL)
BSD 2-Clause "Simplified" License
92 stars 31 forks source link

No results on NVIDIA Corporation GeForce GTX 1080 since 1feeefe #28

Closed webmaster128 closed 5 years ago

webmaster128 commented 6 years ago

Yesterday I bisected the project to find a bug that causes no GPU results on a NVIDIA Corporation GeForce GTX 1080 (Ubuntu 18.04).

Last good commit 30eedd02251ffca:

$ git checkout 30eedd02251ffca && git clean -xdf && cargo build --release --features gpu && RUST_BACKTRACE=1 ./target/release/nano-vanity --gpu --threads 0 --limit 0 1sim
[...]
Estimated attempts needed: 65536
Initializing GPU NVIDIA Corporation GeForce GTX 1080
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 86AD025F1956F881DF5DEC4CB5BEFE344DFD826AAD3122286A99DA3A40427F6C
Account:     xrb_1sim9rd6n1bdp6kh1gqn1iog3pege6pfk7a9swnuy97zu8cfpmpa93x13sy5
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 4B61C9674AE33C783E296E784FACCD32C9210E377ABC4E041C36489E5585BCB4
Account:     xrb_1simacbhanhe1ug6qs5xapubyxreokmbmxbshx16d8tx9pjork8wz61yha5s
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 1BDA385708B0B2DC3ECE00465F328A8FECE8CD28BFF93B830167127FC2BE4E50
Account:     xrb_1simkbzdnrei3gd454ckh1rzhdjorwe9s4ned387g5ha1y6c7dzir9cd51hw
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 6C2DCCFE3F44F7DC18C746B31C2BE3B4908B8A42C04EBBFDDADC68C931635F7E
Account:     xrb_1simunhpat9w4qbaorkrc8a1hb8ihnky1bbrmddccn7yaty8xbsaq46pjtyr
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 9E9B1EE430C324A0EFB8395A4379B0A233EEB2E6EFBE5BC38F8AFC9A4B7B495F
Account:     xrb_1sim3bcst48zbaq7intn9dnydk3xxtefscpred5tbo5mu8fp3ktbtw9a5sru

First bad commit 1feeefe2973e4e9e6f9:

$ git checkout 1feeefe2973e4e9e6f9 && git clean -xdf && cargo build --release --features gpu && RUST_BACKTRACE=1 ./target/release/nano-vanity --gpu --threads 0 --limit 0 1sim
[...]
Estimated attempts needed: 65536
Initializing GPU NVIDIA Corporation GeForce GTX 1080
Tried 184549200 keys (~281599.73%)^C

This results is reproducible.

Now when I apply this code change in the open CL code on top of 1feeefe2973e4e9e6f9, it works again.

diff --git a/src/opencl/entry.cl b/src/opencl/entry.cl
index badd4ed..cc96210 100644
--- a/src/opencl/entry.cl
+++ b/src/opencl/entry.cl
@@ -50,7 +50,7 @@ __kernel void generate_pubkey (__global uchar *result, __global uchar *key_root,
                        public_offset_copy[i] = public_offset[i];
                }
                ge25519 ALIGN(16) public_offset_curvepoint;
-               ge25519_unpack_vartime(&public_offset_curvepoint, public_offset_copy);
+               // ge25519_unpack_vartime(&public_offset_curvepoint, public_offset_copy);
                ge25519_add(&A, &A, &public_offset_curvepoint);
        }
        uchar pubkey[32];
$ git checkout 1feeefe2973e4e9e6f9 && git clean -xdf && git diff && cargo build --release --features gpu && RUST_BACKTRACE=1 ./target/release/nano-vanity --gpu --threads 0 --limit 0 1sim
M   src/opencl/entry.cl
HEAD is now at 1feeefe Add support for public key offset
Removing target/
diff --git a/src/opencl/entry.cl b/src/opencl/entry.cl
index badd4ed..cc96210 100644
--- a/src/opencl/entry.cl
+++ b/src/opencl/entry.cl
@@ -50,7 +50,7 @@ __kernel void generate_pubkey (__global uchar *result, __global uchar *key_root,
                        public_offset_copy[i] = public_offset[i];
                }
                ge25519 ALIGN(16) public_offset_curvepoint;
-               ge25519_unpack_vartime(&public_offset_curvepoint, public_offset_copy);
+               // ge25519_unpack_vartime(&public_offset_curvepoint, public_offset_copy);
                ge25519_add(&A, &A, &public_offset_curvepoint);
        }
        uchar pubkey[32];
   Compiling typenum v1.9.0
   Compiling num-traits v0.2.0
   Compiling unicode-xid v0.0.4
[...]
    Finished release [optimized] target(s) in 30.95s
Estimated attempts needed: 65536
Initializing GPU NVIDIA Corporation GeForce GTX 1080
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 42FB633B2A38F03503DFC08C769B08677ECDD07F8678EC3F036335FCAA363136
Account:     xrb_1simwz4qxrqa35peudfh6n8zsn9hd9okdrrxa6jro9e8ds68ite7q97fkjex
Tried 0 keys (~0.00%)
Found matching account!
Private Key: B8FCF29E515BD3C8314BE42B0A759C9D1F7AB50BCE9EB39247FDB487004ED9DC
Account:     xrb_1simsmt8rw95ygkaos753prcm3jaoi89asw6o8qoaxnqpo8r9zsn691krapf
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 12D864ED008676DD648C42479600C9A44A93DA87DD525421D9AD5E640087FD02
Account:     xrb_1simnhcgarykchdqcwboimjorri5y9pj468ozjxhsdw84t8q9nyd4qdwp434
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 6FA6F470AA8125DFF837FCB16AE55898D787F2F811F9AD571BA2B4D6B40CC654
Account:     xrb_1simjd87p11p7f6ccsbcppzzebkufacxbgtne31cr7g58tdoga3rd8zffuqg
Tried 0 keys (~0.00%)
Found matching account!
Private Key: DD11E26FC962684AF6A34260592FBA36A6305D0390FFE98DE0B0EA8E90CA6E32
Account:     xrb_1simhf5wib7hxspzjgbdhko5811p168zrsg6y5xzgdhqka3qo4xdwshb85r7
Tried 0 keys (~0.00%)
Found matching account!
Private Key: 78E69E4EABFC215E37BDF3FFE6349846C55695ACB3487C9E766C41C5596D075F
Account:     xrb_1simzqnw8uhk39e9os3xan1coycd9ju83ykcnnhxuxqmh1mzrnj4jbgxssu9

The most surprising thing to me is that the change that flips the code between working and non working (I reperated this dozens of times) is not executed at all, since it is in a generate_key_type 2 block only.

Given this and other OpenCL related observations, I guess there is some very aggressive optimization going on. Commenting out the line may cause the code to use less memory or something like that.

In this case, it does not help to set the .local_work_size(1) which which makes only one job per work group, which helped in a different optimization related case.

I am writing this as a note to myself after fighting this issue for hours hoping for some clever OpenCL hero to come along or at least as a warning for other frustrated users.

PlasmaPower commented 6 years ago

I was hoping to switch to gfx-rs, which would compile the compute code at compile time and hopefully eliminate problems like these, but unfortunately writing compute GLSL is a huge pain. If I did that I'd probably make a language with macros that compiles to GLSL, but even that would be a pain, since ed25519 is quite large and I'd have to rewrite a lot of it.

I've had problems like this before. Normally I'd look at the two functions introduced in 1feeefe2973e4e9e6f99bec2c355d88a83dfd9a6, but they seem correct to me. Maybe there's another incorrect function that was previously not called, but is now called by ge25519_unpack_vartime. It could also be that calling ge25519_unpack_vartime introduces another optimization elsewhere in the entry function. You could also try throwing in some unnecessary copies in the entry function in an attempt to throw off the compiler, but that's a suboptimal solution.

webmaster128 commented 5 years ago

I can reproduce this result using Quadro P400 from GPUEater (default CUDA installation from the NVIDIA-410.48+CUDA9.0 Ubuntu16.04 x64 image)

export RUSTFLAGS='-L /usr/local/cuda-9.0/lib64/'

git checkout 30eedd02251ffca && git clean -xdf && git diff && cargo build --features gpu && ./target/debug/nano-vanity --gpu --gpu-threads 65536 --threads 0 --limit 3 1sim
// ok

git checkout 1feeefe2973e4e9 && git clean -xdf && git diff && cargo build --features gpu && ./target/debug/nano-vanity --gpu --gpu-threads 65536 --threads 0 --limit 3 1sim
// broken
PlasmaPower commented 5 years ago

I'm assuming https://github.com/PlasmaPower/nano-vanity/pull/31 doesn't fix it?

webmaster128 commented 5 years ago

Right. I gave this ticket another try to have a solid baseline for #31. But I did not find one.

PlasmaPower commented 5 years ago

Fixed by #31

webmaster128 commented 5 years ago

Let me think loud about why I think this happened: Before #31 we overrode 30 bytes of arbitrary memory with 0s. This may have done nothing bad up to 30eedd02251ffca, so it was pure luck that 30eedd02251ffca worked. With added code comlexity starting with 1feeefe2973e4e9e6f9 this may have caused curruption of important memory.

PlasmaPower commented 5 years ago

Time to look into https://github.com/rust-accel/nvptx ?

I wonder how important AMD support is...

webmaster128 commented 5 years ago

I really like the idea of a cross vendor solution, no matter if you need it today or not.

What I found interesting and also alarming was this:

NVIDIA, which dominates the machine learning market, provides drivers under a proprietary license so that they can modify terms and conditions freely. In fact, they changed their EULA relating GeForce/Titan to restrict the data center deployment and commercial hosting etc.

from https://www.gpueater.com/ (sorry, I want to avoid the any ads but this is the source and I have no better one)

I don't want to give up on OpenCL yet, even though it creates some headache due to not so great tooling