C++ error? Building on Power 8 Processor/Ubuntu 16.04/2x NV Tesla P100/

zackoch commented 7 years ago

Hello!

EDIT: I made a stupid. I've solved my own problem - I'm seeing 140MH/s F*&K Yeah!

patadeloso commented 7 years ago

Congrats!

jimmykl commented 7 years ago

Drool. Hate to think of the ROI though ;-)

patadeloso commented 7 years ago

Played around with these yet?

--cuda-block-size Set the CUDA block work size. Default is 128
--cuda-grid-size Set the CUDA grid size. Default is 8192
--cuda-streams Set the number of CUDA streams. Default is 2
--cuda-schedule <mode> Set the schedule mode for CUDA threads waiting for CUDA devices to finish work. Default is 'sync'. Possible values are:
    auto  - Uses a heuristic based on the number of active CUDA contexts in the process C and the number of logical processors in the system P. If C > P, then yield else spin.
    spin  - Instruct CUDA to actively spin when waiting for results from the device.
    yield - Instruct CUDA to yield its thread when waiting for results from the device.
    sync  - Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the results from the device.
--cuda-devices <0 1 ..n> Select which CUDA GPUs to mine on. Default is to use all
--cude-parallel-hash <1 2 ..8> Define how many hashes to calculate in a kernel, can be scaled to achive better performance. Default=4

zackoch commented 7 years ago

I was thinking about upping the parallel hash to 8 to see what happens. I couldn't find much about grid or block size.

I'm re building the new version right now since I saw it's supposed to yield around 3% better performance.

patadeloso commented 7 years ago

Yeah, I have lots of code to go through. I'm getting ~8Mh/s on my TX2 with these settings:

./ethminer/ethminer -U --cuda-streams 16 --cuda-block-size 128 --cuda-grid-size 32768

cuda-grid-size going down to 4096 dropped my hashrate to ~1Mh/s, passing 16384 doesn't seem to get me over 8Mh/s.

zackoch commented 7 years ago

so I tried a parallel hash of 1,2,4,6, and 8. 6 was the worst, took it down to around 100Mh/s - 8 is very solid at 145Mh/s. 1,2, and 4 seemed to have little effect.

I messed with the grid size, going up to 65536 - it would be like 166Mh/s and 125Mh/s every other on the console output. I tried 131072 and I'd get four 165Mh/s on the console for every 82Mh/s. 262144 yielded like 328Mh/s then 0, then 164. That's when I decided I really have to find some documentation to read about how this works before I fry something:( I did notice the environmental went up a bit at 131072, and 262144. Each GPU went up about 3 degrees C.

I see there's no ability to go over parallel hash of 8 - I'm quite interested what it would do if it was possible. Again, unfortunately I don't technically understand what it does and if there's a limitation other than the code - I was looking at a slide deck for the Tesla GPU's earlier - maybe I'll give that a read.

I did find an occupancy calculator here: https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy-calculator-helps-pick-optimal-thread-block-size/ but it appears to be a bit old.

From everything I've read, you build your application to work kick ass with your gpu. The devs can shun me if I have this backwards, but I tried running nvprof which is a tool to profile the GPU's various things one of them being occupancy. I'm probably going to butcher and be completely wrong but I wasn't able to profile ethminer, because I don't think it's kicked off with cuda. I think cuda starts stuff after you run ethminer, so nvprof can't profile it? That's out of my realm of knowledge.

patadeloso commented 7 years ago

And each COMPUTE version is slightly different. It would nice if someone made a pic of how the cores/multiprocessors/wraps/threads relate to each other. Well, I'm gonna speed some time reading this:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-6-x

YanBellavance commented 7 years ago

think I could collaborate with you on your project?

zackoch commented 7 years ago

Who's project?

YanBellavance commented 7 years ago

yours. It's okay I'll figure it out by myself. I Just started and am eager to get to the point of having a running build.

theobolo commented 7 years ago

Hello guys, i'm actually running the ethminer master branch on K80 Kepler Azure instances.

I tried on a NC24 with 2 TESLA K80 (4 GPU) with theses options on etherminer.org :

--cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 128 --cuda-grid-size 32768

I see a lot of 0MH/s right next to 41MH/s, 82MH/s ... sometimes 126MH/s or 176MH/s one time ... Not seems to be a very strong hash rate, by the way running those options with -M option gives me 58 MH/s at each tries, maybe that should be the average ratio with theses options.

Ok so let's bench :

BENCH 1 : --cuda-block-size 64 --cuda-grid-size 8192 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 2 : --cuda-block-size 128 --cuda-grid-size 8192 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 3 : --cuda-block-size 256 --cuda-grid-size 8192 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 4 : --cuda-block-size 64 --cuda-grid-size 16384 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 5 : --cuda-block-size 128 --cuda-grid-size 16384 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 6 : --cuda-block-size 256 --cuda-grid-size 16384 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 7 : --cuda-block-size 64 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 8 : --cuda-block-size 128 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 9 : --cuda-block-size 256 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 10 : --cuda-block-size 64 --cuda-grid-size 65536 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 11 : --cuda-block-size 128 --cuda-grid-size 65536 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 12 : --cuda-block-size 256 --cuda-grid-size 65536 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 13 : --cuda-block-size 64 --cuda-grid-size 131072 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 14 : --cuda-block-size 128 --cuda-grid-size 131072 --cuda-parallel-hash 8 --cuda-streams 16

BENCH 15 : --cuda-block-size 256 --cuda-grid-size 131072 --cuda-parallel-hash 8 --cuda-streams 16

Ok that's it didn't try more than 256 block size and 131072 grid-size, @YanBellavance @zackoch

YanBellavance commented 7 years ago

@thebolo its you gridsize value and block size. try running without them. then try different combinations. .

YanBellavance commented 7 years ago

--cuda-block-size 128 : try 64 then try 256..

--cuda-grid-size 32768 : try 65536 ,131072

theobolo commented 7 years ago

@YanBellavance I'll edit my post with all the benchs

YanBellavance commented 7 years ago

you dont want --cuda-block-size too high, this is the number of registers available to each thread,you want it just big enought to do its work so you can put more streams in parallell

zackoch commented 7 years ago

@YanBellavance oh definitely - I wasn't sure if you were talking to Patedeloso or myself.

YanBellavance commented 7 years ago

awesome! I just spent the whole week fidling with the genoil miner and got it to build on windows10 msvc 2015 cuda 8.0. I'm poop and gotta start over with this one lol but I learned alot.

Are you running stock software or a custom version of ethminer?

theobolo commented 7 years ago

@YanBellavance trying my best result in local on ethmine.org :

BENCH 9 : --cuda-block-size 256 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16

Give me that :

zackoch commented 7 years ago

@YanBellavance > Are you running stock software or a custom version of ethminer? I checked out the dev/rc0.11 branch - built with OpenCL off and CUDA on and DCompute =60.

@theobolo you have two GPU's? what's your output of nvidia-smi? Looks like one of the GPU's is not happy with that.

theobolo commented 7 years ago

@zackoch I have 4 GPU K80 on a unique NC24 Azure instance, 2 Kepler TESLA > 1 TESLA = 2 GPU and 24Gb of memory (12Gb per GPU).

My Nvidia-smi while running

--cuda-block-size 256 --cuda-grid-size 32768 --cuda-parallel-hash 8 --cuda-streams 16

nvidia-smi :

Seems that the results in local are totally différents when i'm on ethermine.org

Effective Hashrate is really low :

YanBellavance commented 7 years ago

it seems I can't get more than 16.7MHs on my cloud k80. they just spread around and the options don't do much.

I just got rc0.11 but have bee using the prebult for now. will build it tomorrow. I probably need to upgrade my cuda and driver as well:

ubuntu@ip-172-31-17-81:~/mining/miner2/bin$ ./ethminer2 -U --list-devices [CUDA]: Listing CUDA devices. FORMAT: [deviceID] deviceName [0] Tesla K80 Compute version: 3.7 cudaDeviceProp::totalGlobalMem: 11995578368

I am poooped...catch you later :)

theobolo commented 7 years ago

@YanBellavance i tried a lot of ETH miners / releases / etc ... on a K80 cloud GPU, seems that 16-17Mh/s is the best we can reach for the moment.

I was surprised to find that guy talking about 200mh/s on a dual Kepler K80 cloud instance, he also provided a screenshot >>> https://steemit.com/ethereum/@justo/cloud-gpu-nvidia-tesla-dual-kepler-k80-eth-mining-hash-rates

Do you think it's a fake guys ?

By the way, i'm currently running my worker with that options --cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 128 --cuda-grid-size 16384 and i got something stable looking like this on ethermine.org :

I'll try some other options in real situation to find the best for K80. I can say that the last ethminer build is truly improving by 2-3% my hash rate with default settings.

YanBellavance commented 7 years ago

@theobolo did you have the same setup?

the benchmark your are showing, is it for each k80 or total?

YanBellavance commented 7 years ago

16.7MHs is the average. I need to update my driver and install the one for k80, ive been using an old one I don't even know which version it is.

fideling with the parameters I got a bunch of 0Hs then 1 or 2 results but it always averages 16.7

I am eager to go through the code to see how the hash rate is measered because I don't think it is instant. it must be the host getting the results of computations all at once

I am pretty sure of it because I was able to get a single hashrate of 330MHs lol (follower by 19 Hash rates of 0)

theobolo commented 7 years ago

@YanBellavance so i launched a fresh new cluster yesterday, scripted the deployment on 5 virtual machines (NC24 Azure) and used the last RC 0.11 version, compiled from release branch with CUDA = ON / OPENCL = OFF

My cluster is running since 9 hours with that parameter : --cuda-parallel-hash 8 and that's it. There is the average speed :

https://ethermine.org/miners/5748DbE414c445050715AA2346d13194e748A313

remind that ONE Worker = 2 TESLA KEPLER = 4 GPU / worker.

Now i will try 8 more hours by starting ethminer with all default options : ethminer -U

YanBellavance commented 7 years ago

so you are able to get good hash rates on the cloud. does it work like this only on azure? so each worker is a cpu with 2 video cards and each card has 2 GPUs right?

theobolo commented 7 years ago

@YanBellavance Yes absolutly, but i'm not doing that because it's a good investment, 5 NC24 costs 13 000$ / month on Azure, for only 250MH/s .... The ROI is 1200$ per month @currentETHprice...

I have a Azure account with 280 000$ on it that's why i'm doing that

zackoch commented 7 years ago

@theobolo I'm seeing the same thing with the parallel streams - best performance with 8, and if I mess with the other settings it seems to make it worse.

I wish there was someone who knew about how we can use nvprof to check occupancy. I have a suspicion that my GPU's aren't being used to their full potential since they're not drawing max power.

zackoch commented 7 years ago

@YanBellavance what is the expected hash rate on those k80's?

YanBellavance commented 7 years ago

holy crap. are you spending 13G's a month? I like your perseverence. What is your objective?...and I just saw your 280 000$ lol wow!!! I guess it's impossible to get some anymore right? There's a 20G$ credit for startups but I don't know if it can apply to cloud.

YanBellavance commented 7 years ago

maybe you should save some of that gas in case I can crack something :D

@zackoch I don't know yet but I am guessing 40% of what the P100 can do. For startes I have to make sure I am getting a full k80, and not half a card on AWS. I read 12GB RAM and 1 GPU. a single k80 has 25GB RAM and 2 dies. Starting a new ubuntu 16 image from scratch today with latest drivers.

@theobolo btw I was able to get 13 MHs on a single g2.2x that I am using at half so with a g2.8x us used at half is 52 MHs,

https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/

The guy in the link says he had access to 15GB of ram on a g2.2x. I had: 8GB RAM...does that mean I could get 100 MH/s on a g2.8x by just getting a new image with latest drivers and rebuild?

A dedicated ECU would be advantageous on a p2x16 since there is a 2$/hour/instance extra charge to get dedicated hosting.

then I start modifying the code to optimize. I really hope I dont meet a dead end lol because of the cloud lol :D

theobolo commented 7 years ago

@YanBellavance Yep that's something like that 13MH/s per K80 GPU. For the moment i can't do better with 4x K80 than 55MH/s.

https://ethermine.org/miners/5748DbE414c445050715AA2346d13194e748A313

I would imagine something like 100MH/s per workers in my greatest dreams ...

YanBellavance commented 7 years ago

must definitly be s software issue...that is what I would get on g2.2x. did you try dedicated hosting^

theobolo commented 7 years ago

What ? with 1 g2.2xlarge you reach 100MH/s ?

YanBellavance commented 7 years ago

no 13MHs butI only have access to half a card (1gpu) on g2.2x and p2.x. So a g2.8x would be 4*13= 52MH/s and p2.16x would be 16.7X16 256MHs? (if it really gives me 16 gpus lol)

Accounting for the fact that I have only a half k80, then I should be able to tweak it to 30 MHs per die.

YanBellavance commented 7 years ago

looks like this half card thing is standard its shared as a vm I gotta contact support because their articles are saying otherwise. He must be using dedicated hosting

https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/

ghost commented 7 years ago

Adding "--cuda-parallel-hash 8 --cuda-streams 16" let P100 * 8 benchmark from 530Mh/s to 560Mh/s, and on the ethermine.org shows 585MH/s

But it seems set --cuda-block-size --cuda-grid-size get no good improvement, needs to read some Pascal Arch. and CUDA 8 whitepaper more

YanBellavance commented 7 years ago

I am making a "driver" for the k80 and P100 to get those/more/all the bells and whistles :)

theobolo commented 7 years ago

@YanBellavance that's great !!! ;)

manpowre commented 7 years ago

There is a cuda8 patch that came a few days ago, can you guys test that ?

zackoch commented 7 years ago

@manpowre link?

@YanBellavance - wait, like you're rewriting the Nvidia driver?! Whaa?

theobolo commented 7 years ago

@manpowre yep link and i'll test that right now !

theobolo commented 7 years ago

@zackoch @manpowre Found it : https://developer.nvidia.com/compute/cuda/8.0/Prod2/patches/2/cuda-repo-ubuntu1604-8-0-local-cublas-performance-update_8.0.61-1_amd64-deb

manpowre commented 7 years ago

@zackoch , can you run your ethminer with this flag: --farm-recheck 2000 .. on your P100's, and report back the different values you get related to mh/s ? thanks..

theobolo commented 7 years ago

Without Patch on 4 x K80 GPU :

With CUDA Patch :

A little bit more effective ...

zackoch commented 7 years ago

@manpowre yes I will try later.

theobolo commented 7 years ago

For the moment the CUDA patch is really improving perfs on K80 :

Now : 48-52Mh/s per worker on average

Before : 42-46 Mh/s per worker on average

manpowre commented 7 years ago

@YanBellavance --> __dp4a() ? hehe.. I tested this last night, I got to the nvidia spec of performance with a separate cuda program. now I need to identify where in the ethminer cudacores this will fit. there are several complex cores in our project. should be fairly easy to implement.

theobolo commented 7 years ago

@YanBellavance Excited trying that 💃

manpowre commented 7 years ago

__dp4a():

  m  10:18:54|ethminer  Mining on PoWhash #8c8c8291 : 191.14MH/s [A0+0:R0+0:F0]
  m  10:18:55|ethminer  Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
  m  10:18:56|ethminer  Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
  m  10:18:57|ethminer  Mining on PoWhash #8c8c8291 : 232.10MH/s [A0+0:R0+0:F0]
  m  10:18:58|ethminer  Mining on PoWhash #8c8c8291 : 232.10MH/s [A0+0:R0+0:F0]
  m  10:18:59|ethminer  Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
  m  10:19:00|ethminer  Mining on PoWhash #8c8c8291 : 232.10MH/s [A0+0:R0+0:F0]
  m  10:19:01|ethminer  Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]
  m  10:19:02|ethminer  Mining on PoWhash #8c8c8291 : 245.75MH/s [A0+0:R0+0:F0]

Basically 3-4 times performance on 2x 1080ti, but its not finding any blocks.. so I broke something, but code is running, also in benchmark mode its running. Can only be done on sm_60 and sm_61 according to nvidia docs.

and Im using 241-254w on each board.

ethereum-mining / ethminer

C++ error? Building on Power 8 Processor/Ubuntu 16.04/2x NV Tesla P100/ #68