Open aleqx opened 5 years ago
Since this miner is closed source we can not check which changes result in better performance.
Some clue in the allocated memory perhaps
Not much reason someone couldn't run it and freeze the GPU and dump their running kernel binaries, for science.
Someone else...
Also probably has something to do with abandoning half the GPUs people have. We support capability 2.0 and higher not just 5.0 and higher. And the count of issue reports when 2.x and 3.x stuff stops working proves lots of people still chew coins with old GPUs, in which case any hash is better than the Zero you get with dredge, because you have to have GTX9xx or newer. I bet they use some of the features only available on newer GPUs like simultaneous compute and transfer. We pause work for some ms to transfer because that's how earlier GPUs must do it, they won't chat while running a kernel, and require a lot more sync points where all threads must wait for each other to finish. This leads to less than spectacular utilization due to waiting around for stuff to report results and receive a new job. Also 5.0+ have much better unified memory (transfer only what changed, and on-access too, and without "extra" sync points) which we also don't use for the same reasons, so yes their memory access and transfer is probably much better, from using features that preclude older GPU family.
Nice summary of possible improvements. I'm curious though: if you have more pauses and sequential operations (as you said, you sit and wait while they can do simultaneous compute & transfer) then theirs should theoretically draw more power ... whereas I'm observing the opposite (power draw in most algos is visibly smaller in their miner at the same time as improved hashrate).
Bad utilization can also draw more power while doing less work. If it copies the entire UVM range across the PCIe every time work swaps out (~10s) and all work must sync up at that same point, it is doing more busywork (overhead) to do the same work. If it uses the newer UVM stuff then it only copies changed memory, and without sync, which is like maybe 8% of the garbage generation and collection of the full sync/lockstep (legacy CUDA) does. So the hardware all does more mining, smoother, and less shoveling. I think the various sync_mode may make some difference to power usage also.
Using nvidia through the opencl backend works even worse and I'd bet draws more power too / same sort of utilization problem / less smoothness.
The CUDA docs are pretty clear about how important occupancy and utilization are, and how super-cool the newer stuff is but it more or less takes a rewrite to change methods.
I love your miner, and kudos for keeping it open source. I would much rather help you than proprietary miners.
That being said, have you had a look at the performance that CryptoDredge is getting on CN algos for Nvidia card (particularly all heavy variants and aeon/cnlite-l1)? It's quite a lot faster and draws less power at the same time. I can't figure out why. I did already find the optimum threads and blocks values for xmr-stak (by an extensive exhaustive search) so it's not that ...