dr-bonez / tor-v3-vanity

A TOR v3 vanity url generator designed to run on an NVIDIA GPU.
https://crates.io/crates/tor-v3-vanity
MIT License
133 stars 22 forks source link

Why is the performance so terrible? #11

Open FreeApophis opened 3 years ago

FreeApophis commented 3 years ago

The tor-v3 vanity generator on cathugger/mkp224o has no GPU support and is faster on my 10 year old CPU than the numbers shown in the README here.

I get about 1 vanity hash every 10 seconds for 5 character prefix on my CPU.

I did not test if the numbers are true, cause I do not have Nvidia GPU, but this should be A LOT faster on the GPU than this.

Anyone tested the both implementation side by side? Is tor-v3-vanity really that slow? That does not sound right.

With scallion it was easily possible to have 8 and 9 character prefixes.

marcialvieira commented 3 years ago

Testing using mkp224o on an i7-8565U, even with the best optimization for me (--enable-binsearch --enable-amd64-64-24k --enable-intfilter=64), I'm only getting ~15MK/sec, and running the tor-v3-vanity with a GTX-1660, I'm getting ~5GK/sec.

However I noticed that only 1 core is busy, maybe if the validations were forwarded and validated in multi-thread, I would get more performance, taking better advantage of the keys generated by the GPU.

marcialvieira commented 3 years ago

@FreeApophis is correct, the output information is confusing, it wasn't 5GK/sec, the output was showing me a cumulative count, so the correct count is 297KK/sec.

BTW: I'm getting 368KK/s with the mkp224o on a raspberry pi 2. :O

FreeApophis commented 3 years ago

Thanks for the numbers, so there is defintily something wrong with the implementation when a Raspberry is faster than a GTX-1660.

23cku0r commented 3 years ago

4x2080Ti image

4x3090 image

marcialvieira commented 3 years ago

As you can see @23cku0r posted, his benchmark is 8x my raspberry pi 2 performance, so just 2 rasps CPU-based have the equivalent performance of a 2080Ti GPU-based performance. lol

megapro17 commented 2 years ago

Languages Rust 100.0%

dr-bonez commented 2 years ago

I took a look at the code again, and I don't see an obvious reason why it should be so much slower. This was a weekend pet project I threw together a while back just to try out the nvptx target for rust. I have too much going on right now to look into this, but if anyone takes the time to instrument the code and determine where the bottleneck is, I'm happy to address the problem.

dr-bonez commented 2 years ago

My best guess is that there's an issue with automatic block size detection. 256 threads with 272 blocks seems low for a 2080ti.

ghost commented 2 years ago

Something is definitely wrong here, this is my experience running it on my gtx 1080

=27116== NVPROF is profiling process 27116, command: ./t3v -d keys hello
Launching kernel on device #0 with 256 threads and 60 blocks
Tried 2012160 / 33554432 (expected) keys.
Running for 30 seconds / 8 minutes, 21 seconds (expected).
Tried 4024320 / 33554432 (expected) keys.
Running for 1 minutes, 0 seconds / 8 minutes, 21 seconds (expected).
Tried 6036480 / 33554432 (expected) keys.
Running for 1 minutes, 30 seconds / 8 minutes, 21 seconds (expected).
^C==27116== Profiling application: ./t3v -d keys hello
==27116== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==27116== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  90.9431s       397  229.08ms  213.42ms  253.10ms  render
                    0.00%  591.57us       794     745ns     351ns  3.0080us  [CUDA memcpy DtoH]
                    0.00%  226.64us       404     561ns     480ns  1.2480us  [CUDA memcpy HtoD]
      API calls:   99.87%  90.9558s       397  229.11ms  213.43ms  253.10ms  cuStreamSynchronize
                    0.11%  100.82ms         1  100.82ms  100.82ms  100.82ms  cuCtxCreate
                    0.01%  10.058ms       794  12.667us  5.8100us  93.444us  cuMemcpyDtoH
                    0.01%  5.0190ms       398  12.610us  9.2540us  72.085us  cuLaunchKernel
                    0.00%  1.7040ms       404  4.2170us  2.4680us  155.27us  cuMemcpyHtoD
                    0.00%  1.6310ms         1  1.6310ms  1.6310ms  1.6310ms  cuModuleLoadData
                    0.00%  255.92us       399     641ns     280ns  1.9780us  cuModuleGetFunction
                    0.00%  109.00us         6  18.166us  1.7130us  99.015us  cuMemAlloc
                    0.00%  9.9490us         1  9.9490us  9.9490us  9.9490us  cuStreamCreateWithPriority
                    0.00%  4.9050us         1  4.9050us  4.9050us  4.9050us  cuDeviceGetPCIBusId
                    0.00%  1.7310us         6     288ns     139ns     553ns  cuDeviceGetAttribute
                    0.00%     832ns         3     277ns     107ns     554ns  cuDeviceGetCount
                    0.00%     555ns         2     277ns     101ns     454ns  cuFuncGetAttribute
                    0.00%     500ns         2     250ns      98ns     402ns  cuDeviceGet
scramblr commented 2 years ago

Same issues here. Figured I was just having bad luck, but no.. running on an 8 GPU server produces less result than multi processor mkp224o. Was really looking forward to this too, as it's the ONLY solution currently in existence for v3 onions.