Closed zackoch closed 7 years ago
Congrats!
Drool. Hate to think of the ROI though ;-)
Played around with these yet?
--cuda-block-size Set the CUDA block work size. Default is 128
--cuda-grid-size Set the CUDA grid size. Default is 8192
--cuda-streams Set the number of CUDA streams. Default is 2
--cuda-schedule <mode> Set the schedule mode for CUDA threads waiting for CUDA devices to finish work. Default is 'sync'. Possible values are:
auto - Uses a heuristic based on the number of active CUDA contexts in the process C and the number of logical processors in the system P. If C > P, then yield else spin.
spin - Instruct CUDA to actively spin when waiting for results from the device.
yield - Instruct CUDA to yield its thread when waiting for results from the device.
sync - Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the results from the device.
--cuda-devices <0 1 ..n> Select which CUDA GPUs to mine on. Default is to use all
--cude-parallel-hash <1 2 ..8> Define how many hashes to calculate in a kernel, can be scaled to achive better performance. Default=4
I was thinking about upping the parallel hash to 8 to see what happens. I couldn't find much about grid or block size.
I'm re building the new version right now since I saw it's supposed to yield around 3% better performance.
Yeah, I have lots of code to go through. I'm getting ~8Mh/s on my TX2 with these settings:
./ethminer/ethminer -U --cuda-streams 16 --cuda-block-size 128 --cuda-grid-size 32768
cuda-grid-size going down to 4096 dropped my hashrate to ~1Mh/s, passing 16384 doesn't seem to get me over 8Mh/s.
so I tried a parallel hash of 1,2,4,6, and 8. 6 was the worst, took it down to around 100Mh/s - 8 is very solid at 145Mh/s. 1,2, and 4 seemed to have little effect.
I messed with the grid size, going up to 65536 - it would be like 166Mh/s and 125Mh/s every other on the console output. I tried 131072 and I'd get four 165Mh/s on the console for every 82Mh/s. 262144 yielded like 328Mh/s then 0, then 164. That's when I decided I really have to find some documentation to read about how this works before I fry something:( I did notice the environmental went up a bit at 131072, and 262144. Each GPU went up about 3 degrees C.
I see there's no ability to go over parallel hash of 8 - I'm quite interested what it would do if it was possible. Again, unfortunately I don't technically understand what it does and if there's a limitation other than the code - I was looking at a slide deck for the Tesla GPU's earlier - maybe I'll give that a read.
I did find an occupancy calculator here: https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy-calculator-helps-pick-optimal-thread-block-size/ but it appears to be a bit old.
From everything I've read, you build your application to work kick ass with your gpu. The devs can shun me if I have this backwards, but I tried running nvprof which is a tool to profile the GPU's various things one of them being occupancy. I'm probably going to butcher and be completely wrong but I wasn't able to profile ethminer, because I don't think it's kicked off with cuda. I think cuda starts stuff after you run ethminer, so nvprof can't profile it? That's out of my realm of knowledge.
And each COMPUTE version is slightly different. It would nice if someone made a pic of how the cores/multiprocessors/wraps/threads relate to each other. Well, I'm gonna speed some time reading this:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-6-x
think I could collaborate with you on your project?
Who's project?
yours. It's okay I'll figure it out by myself. I Just started and am eager to get to the point of having a running build.
Hello guys, i'm actually running the ethminer master branch on K80 Kepler Azure instances.
I tried on a NC24 with 2 TESLA K80 (4 GPU) with theses options on etherminer.org :
--cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 128 --cuda-grid-size 32768
I see a lot of 0MH/s right next to 41MH/s, 82MH/s ... sometimes 126MH/s or 176MH/s one time ... Not seems to be a very strong hash rate, by the way running those options with -M option gives me 58 MH/s at each tries, maybe that should be the average ratio with theses options.
Ok so let's bench :
BENCH 1 : --cuda-block-size 64 --cuda-grid-size 8192 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 2 : --cuda-block-size 128 --cuda-grid-size 8192 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 3 : --cuda-block-size 256 --cuda-grid-size 8192 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 4 : --cuda-block-size 64 --cuda-grid-size 16384 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 5 : --cuda-block-size 128 --cuda-grid-size 16384 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 6 : --cuda-block-size 256 --cuda-grid-size 16384 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 7 : --cuda-block-size 64 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 8 : --cuda-block-size 128 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 9 : --cuda-block-size 256 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 10 : --cuda-block-size 64 --cuda-grid-size 65536 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 11 : --cuda-block-size 128 --cuda-grid-size 65536 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 12 : --cuda-block-size 256 --cuda-grid-size 65536 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 13 : --cuda-block-size 64 --cuda-grid-size 131072 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 14 : --cuda-block-size 128 --cuda-grid-size 131072 --cuda-parallel-hash 8 --cuda-streams 16
BENCH 15 : --cuda-block-size 256 --cuda-grid-size 131072 --cuda-parallel-hash 8 --cuda-streams 16
Ok that's it didn't try more than 256 block size and 131072 grid-size, @YanBellavance @zackoch
@thebolo its you gridsize value and block size. try running without them. then try different combinations. .
--cuda-block-size 128 : try 64 then try 256..
--cuda-grid-size 32768 : try 65536 ,131072
@YanBellavance I'll edit my post with all the benchs
you dont want --cuda-block-size too high, this is the number of registers available to each thread,you want it just big enought to do its work so you can put more streams in parallell
@YanBellavance oh definitely - I wasn't sure if you were talking to Patedeloso or myself.
awesome! I just spent the whole week fidling with the genoil miner and got it to build on windows10 msvc 2015 cuda 8.0. I'm poop and gotta start over with this one lol but I learned alot.
Are you running stock software or a custom version of ethminer?
@YanBellavance trying my best result in local on ethmine.org :
BENCH 9 : --cuda-block-size 256 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16
Give me that :
@YanBellavance > Are you running stock software or a custom version of ethminer? I checked out the dev/rc0.11 branch - built with OpenCL off and CUDA on and DCompute =60.
@theobolo you have two GPU's? what's your output of nvidia-smi? Looks like one of the GPU's is not happy with that.
@zackoch I have 4 GPU K80 on a unique NC24 Azure instance, 2 Kepler TESLA > 1 TESLA = 2 GPU and 24Gb of memory (12Gb per GPU).
My Nvidia-smi while running
--cuda-block-size 256 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16
nvidia-smi :
Seems that the results in local are totally différents when i'm on ethermine.org
Effective Hashrate is really low :
it seems I can't get more than 16.7MHs on my cloud k80. they just spread around and the options don't do much.
I just got rc0.11 but have bee using the prebult for now. will build it tomorrow. I probably need to upgrade my cuda and driver as well:
ubuntu@ip-172-31-17-81:~/mining/miner2/bin$ ./ethminer2 -U --list-devices [CUDA]: Listing CUDA devices. FORMAT: [deviceID] deviceName [0] Tesla K80 Compute version: 3.7 cudaDeviceProp::totalGlobalMem: 11995578368
I am poooped...catch you later :)
@YanBellavance i tried a lot of ETH miners / releases / etc ... on a K80 cloud GPU, seems that 16-17Mh/s is the best we can reach for the moment.
I was surprised to find that guy talking about 200mh/s on a dual Kepler K80 cloud instance, he also provided a screenshot >>> https://steemit.com/ethereum/@justo/cloud-gpu-nvidia-tesla-dual-kepler-k80-eth-mining-hash-rates
Do you think it's a fake guys ?
By the way, i'm currently running my worker with that options --cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 128 --cuda-grid-size 16384
and i got something stable looking like this on ethermine.org :
I'll try some other options in real situation to find the best for K80. I can say that the last ethminer build is truly improving by 2-3% my hash rate with default settings.
@theobolo did you have the same setup?
the benchmark your are showing, is it for each k80 or total?
16.7MHs is the average. I need to update my driver and install the one for k80, ive been using an old one I don't even know which version it is.
fideling with the parameters I got a bunch of 0Hs then 1 or 2 results but it always averages 16.7
I am eager to go through the code to see how the hash rate is measered because I don't think it is instant. it must be the host getting the results of computations all at once
I am pretty sure of it because I was able to get a single hashrate of 330MHs lol (follower by 19 Hash rates of 0)
@YanBellavance so i launched a fresh new cluster yesterday, scripted the deployment on 5 virtual machines (NC24 Azure) and used the last RC 0.11 version, compiled from release branch with CUDA = ON / OPENCL = OFF
My cluster is running since 9 hours with that parameter : --cuda-parallel-hash 8
and that's it.
There is the average speed :
https://ethermine.org/miners/5748DbE414c445050715AA2346d13194e748A313
remind that ONE Worker = 2 TESLA KEPLER = 4 GPU / worker.
Now i will try 8 more hours by starting ethminer with all default options : ethminer -U
so you are able to get good hash rates on the cloud. does it work like this only on azure? so each worker is a cpu with 2 video cards and each card has 2 GPUs right?
@YanBellavance Yes absolutly, but i'm not doing that because it's a good investment, 5 NC24 costs 13 000$ / month on Azure, for only 250MH/s .... The ROI is 1200$ per month @currentETHprice...
I have a Azure account with 280 000$ on it that's why i'm doing that
@theobolo I'm seeing the same thing with the parallel streams - best performance with 8, and if I mess with the other settings it seems to make it worse.
I wish there was someone who knew about how we can use nvprof to check occupancy. I have a suspicion that my GPU's aren't being used to their full potential since they're not drawing max power.
@YanBellavance what is the expected hash rate on those k80's?
holy crap. are you spending 13G's a month? I like your perseverence. What is your objective?...and I just saw your 280 000$ lol wow!!! I guess it's impossible to get some anymore right? There's a 20G$ credit for startups but I don't know if it can apply to cloud.
maybe you should save some of that gas in case I can crack something :D
@zackoch I don't know yet but I am guessing 40% of what the P100 can do. For startes I have to make sure I am getting a full k80, and not half a card on AWS. I read 12GB RAM and 1 GPU. a single k80 has 25GB RAM and 2 dies. Starting a new ubuntu 16 image from scratch today with latest drivers.
@theobolo btw I was able to get 13 MHs on a single g2.2x that I am using at half so with a g2.8x us used at half is 52 MHs,
https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/
The guy in the link says he had access to 15GB of ram on a g2.2x. I had: 8GB RAM...does that mean I could get 100 MH/s on a g2.8x by just getting a new image with latest drivers and rebuild?
A dedicated ECU would be advantageous on a p2x16 since there is a 2$/hour/instance extra charge to get dedicated hosting.
then I start modifying the code to optimize. I really hope I dont meet a dead end lol because of the cloud lol :D
@YanBellavance Yep that's something like that 13MH/s per K80 GPU. For the moment i can't do better with 4x K80 than 55MH/s.
https://ethermine.org/miners/5748DbE414c445050715AA2346d13194e748A313
I would imagine something like 100MH/s per workers in my greatest dreams ...
must definitly be s software issue...that is what I would get on g2.2x. did you try dedicated hosting^
What ? with 1 g2.2xlarge you reach 100MH/s ?
no 13MHs butI only have access to half a card (1gpu) on g2.2x and p2.x. So a g2.8x would be 4*13= 52MH/s and p2.16x would be 16.7X16 256MHs? (if it really gives me 16 gpus lol)
Accounting for the fact that I have only a half k80, then I should be able to tweak it to 30 MHs per die.
looks like this half card thing is standard its shared as a vm I gotta contact support because their articles are saying otherwise. He must be using dedicated hosting
https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/
Adding "--cuda-parallel-hash 8 --cuda-streams 16" let P100 * 8 benchmark from 530Mh/s to 560Mh/s, and on the ethermine.org shows 585MH/s
But it seems set --cuda-block-size --cuda-grid-size get no good improvement, needs to read some Pascal Arch. and CUDA 8 whitepaper more
I am making a "driver" for the k80 and P100 to get those/more/all the bells and whistles :)
@YanBellavance that's great !!! ;)
There is a cuda8 patch that came a few days ago, can you guys test that ?
@manpowre link?
@YanBellavance - wait, like you're rewriting the Nvidia driver?! Whaa?
@manpowre yep link and i'll test that right now !
@zackoch , can you run your ethminer with this flag: --farm-recheck 2000 .. on your P100's, and report back the different values you get related to mh/s ? thanks..
Without Patch on 4 x K80 GPU :
With CUDA Patch :
A little bit more effective ...
@manpowre yes I will try later.
For the moment the CUDA patch is really improving perfs on K80 :
Now : 48-52Mh/s per worker on average
Before : 42-46 Mh/s per worker on average
@YanBellavance --> __dp4a() ? hehe.. I tested this last night, I got to the nvidia spec of performance with a separate cuda program. now I need to identify where in the ethminer cudacores this will fit. there are several complex cores in our project. should be fairly easy to implement.
@YanBellavance Excited trying that 💃
__dp4a():
m 10:18:54|ethminer Mining on PoWhash #8c8c8291 : 191.14MH/s [A0+0:R0+0:F0]
m 10:18:55|ethminer Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
m 10:18:56|ethminer Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
m 10:18:57|ethminer Mining on PoWhash #8c8c8291 : 232.10MH/s [A0+0:R0+0:F0]
m 10:18:58|ethminer Mining on PoWhash #8c8c8291 : 232.10MH/s [A0+0:R0+0:F0]
m 10:18:59|ethminer Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
m 10:19:00|ethminer Mining on PoWhash #8c8c8291 : 232.10MH/s [A0+0:R0+0:F0]
m 10:19:01|ethminer Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
m 10:19:02|ethminer Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
Basically 3-4 times performance on 2x 1080ti, but its not finding any blocks.. so I broke something, but code is running, also in benchmark mode its running. Can only be done on sm_60 and sm_61 according to nvidia docs.
and Im using 241-254w on each board.
Hello!
EDIT: I made a stupid. I've solved my own problem - I'm seeing 140MH/s F*&K Yeah!