fireice-uk / xmr-stak

Free Monero RandomX Miner and unified CryptoNight miner
GNU General Public License v3.0
4.06k stars 1.79k forks source link

can you match CryptoDredge (on Nvidia at least)? #2375

Open aleqx opened 5 years ago

aleqx commented 5 years ago

I love your miner, and kudos for keeping it open source. I would much rather help you than proprietary miners.

That being said, have you had a look at the performance that CryptoDredge is getting on CN algos for Nvidia card (particularly all heavy variants and aeon/cnlite-l1)? It's quite a lot faster and draws less power at the same time. I can't figure out why. I did already find the optimum threads and blocks values for xmr-stak (by an extensive exhaustive search) so it's not that ...

psychocrypt commented 5 years ago

Since this miner is closed source we can not check which changes result in better performance.

aleqx commented 5 years ago

Some clue in the allocated memory perhaps

Spudz76 commented 5 years ago

Not much reason someone couldn't run it and freeze the GPU and dump their running kernel binaries, for science.

Someone else...

Spudz76 commented 5 years ago

Also probably has something to do with abandoning half the GPUs people have. We support capability 2.0 and higher not just 5.0 and higher. And the count of issue reports when 2.x and 3.x stuff stops working proves lots of people still chew coins with old GPUs, in which case any hash is better than the Zero you get with dredge, because you have to have GTX9xx or newer. I bet they use some of the features only available on newer GPUs like simultaneous compute and transfer. We pause work for some ms to transfer because that's how earlier GPUs must do it, they won't chat while running a kernel, and require a lot more sync points where all threads must wait for each other to finish. This leads to less than spectacular utilization due to waiting around for stuff to report results and receive a new job. Also 5.0+ have much better unified memory (transfer only what changed, and on-access too, and without "extra" sync points) which we also don't use for the same reasons, so yes their memory access and transfer is probably much better, from using features that preclude older GPU family.

aleqx commented 5 years ago

Nice summary of possible improvements. I'm curious though: if you have more pauses and sequential operations (as you said, you sit and wait while they can do simultaneous compute & transfer) then theirs should theoretically draw more power ... whereas I'm observing the opposite (power draw in most algos is visibly smaller in their miner at the same time as improved hashrate).

Spudz76 commented 5 years ago

Bad utilization can also draw more power while doing less work. If it copies the entire UVM range across the PCIe every time work swaps out (~10s) and all work must sync up at that same point, it is doing more busywork (overhead) to do the same work. If it uses the newer UVM stuff then it only copies changed memory, and without sync, which is like maybe 8% of the garbage generation and collection of the full sync/lockstep (legacy CUDA) does. So the hardware all does more mining, smoother, and less shoveling. I think the various sync_mode may make some difference to power usage also.

Using nvidia through the opencl backend works even worse and I'd bet draws more power too / same sort of utilization problem / less smoothness.

The CUDA docs are pretty clear about how important occupancy and utilization are, and how super-cool the newer stuff is but it more or less takes a rewrite to change methods.