Closed webmaster128 closed 5 years ago
I was hoping to switch to gfx-rs, which would compile the compute code at compile time and hopefully eliminate problems like these, but unfortunately writing compute GLSL is a huge pain. If I did that I'd probably make a language with macros that compiles to GLSL, but even that would be a pain, since ed25519 is quite large and I'd have to rewrite a lot of it.
I've had problems like this before. Normally I'd look at the two functions introduced in 1feeefe2973e4e9e6f99bec2c355d88a83dfd9a6, but they seem correct to me. Maybe there's another incorrect function that was previously not called, but is now called by ge25519_unpack_vartime. It could also be that calling ge25519_unpack_vartime introduces another optimization elsewhere in the entry function. You could also try throwing in some unnecessary copies in the entry function in an attempt to throw off the compiler, but that's a suboptimal solution.
I can reproduce this result using Quadro P400 from GPUEater (default CUDA installation from the NVIDIA-410.48+CUDA9.0 Ubuntu16.04 x64 image)
export RUSTFLAGS='-L /usr/local/cuda-9.0/lib64/'
git checkout 30eedd02251ffca && git clean -xdf && git diff && cargo build --features gpu && ./target/debug/nano-vanity --gpu --gpu-threads 65536 --threads 0 --limit 3 1sim
// ok
git checkout 1feeefe2973e4e9 && git clean -xdf && git diff && cargo build --features gpu && ./target/debug/nano-vanity --gpu --gpu-threads 65536 --threads 0 --limit 3 1sim
// broken
I'm assuming https://github.com/PlasmaPower/nano-vanity/pull/31 doesn't fix it?
Right. I gave this ticket another try to have a solid baseline for #31. But I did not find one.
Fixed by #31
Let me think loud about why I think this happened: Before #31 we overrode 30 bytes of arbitrary memory with 0s. This may have done nothing bad up to 30eedd02251ffca, so it was pure luck that 30eedd02251ffca worked. With added code comlexity starting with 1feeefe2973e4e9e6f9 this may have caused curruption of important memory.
Time to look into https://github.com/rust-accel/nvptx ?
I wonder how important AMD support is...
I really like the idea of a cross vendor solution, no matter if you need it today or not.
What I found interesting and also alarming was this:
NVIDIA, which dominates the machine learning market, provides drivers under a proprietary license so that they can modify terms and conditions freely. In fact, they changed their EULA relating GeForce/Titan to restrict the data center deployment and commercial hosting etc.
from https://www.gpueater.com/ (sorry, I want to avoid the any ads but this is the source and I have no better one)
I don't want to give up on OpenCL yet, even though it creates some headache due to not so great tooling
Yesterday I bisected the project to find a bug that causes no GPU results on a NVIDIA Corporation GeForce GTX 1080 (Ubuntu 18.04).
Last good commit 30eedd02251ffca:
First bad commit 1feeefe2973e4e9e6f9:
This results is reproducible.
Now when I apply this code change in the open CL code on top of 1feeefe2973e4e9e6f9, it works again.
The most surprising thing to me is that the change that flips the code between working and non working (I reperated this dozens of times) is not executed at all, since it is in a generate_key_type 2 block only.
Given this and other OpenCL related observations, I guess there is some very aggressive optimization going on. Commenting out the line may cause the code to use less memory or something like that.
In this case, it does not help to set the
.local_work_size(1)
which which makes only one job per work group, which helped in a different optimization related case.I am writing this as a note to myself after fighting this issue for hours hoping for some clever OpenCL hero to come along or at least as a warning for other frustrated users.