After commits @Jan 19, 2018

AndreaLanfranchi commented 6 years ago

I have just compiled my "nightly" build from MASTER branch.

Unfortunately I have to record some new issues and confirm older ones:

If something wrong from one of the threads for GPU the program simply stops responding, stops outputting any hash rate,all GPUs get idle and the process does not stop: (there are no visible logs from halted GPUs and / or program errors.
Stale share ratio keeps staying 110% more than claymore's baseline.
Yet I cannot overclock GPU's at the same level of Claymore's thus keeping me below in hashrate of a good 10% (which is not compensated by the absence of a fee)

Here is a sample of 24hours round on Claymore's 10. 2018-01-19 ethermine

And here the situation after activation of ethminer on miner "Nostromo". See stale shares increase and instability. 2018-01-19 ethermine 01

Please let me know if I can help.

jean-m-cyr commented 6 years ago

@AndreaLanfranchi Yep. I totally screwed up and merged a bad commit. Something I pushed from the wrong machine!!! I've tried to correct the mistake with pull request #594 but too late for the nightly. :-( If you can build you can create the corrected version with

git clone https://github.com/ethereum-mining/ethminer.git
cd ethminer
git pull origin refs/pull/594/head
mkdir build
cd build
etc...

Once again. My apologies for messing up.

jean-m-cyr commented 6 years ago

If you are running windows you can get the fixed image at: https://ci.appveyor.com/api/buildjobs/bb99jn7wug4ri7sm/artifacts/build%2Fethminer-0.13.0.dev0-Windows.zip Unfortunately the ci build servers seem to be broken for Linux right now.

AndreaLanfranchi commented 6 years ago

Thank you for your work @jean-m-cyr I am already back to home right now. Will apply your suggestion tomorrow or (more likely) on Sunday. I appreciate your efforts.

Best

jean-m-cyr commented 6 years ago

@AndreaLanfranchi

Will apply your suggestion tomorrow or (more likely) on Sunday.

I don't even know what day of the week it is anymore! :-) Hopefully this will be back into the mainline by Sunday.

jean-m-cyr commented 6 years ago

@AndreaLanfranchi If it's any consolation, all of this restructuring and turmoil was necessary so that I can get to work on shortening the job switch time for cuda. It presently is excessively long and variable, sometimes reaching 150 ms. That's time wasted not searching for all GPUs.

We already made the path from the time we discover a solution till the time we send to the pool very light. Now I'd like to work at the other end, minimizing the time it takes for stratum to get the GPUs started on new work. Not only is this path slow, but it can interfere with the sending of solution. Cutting off the GPUs earlier from working on old jobs might reduce the share rate... who knows?

aleqx commented 6 years ago

@jean-m-cyr Many thanks for your work! It really is very appreciated, and I would like to donate. You should think about putting Paypal and ETH addresses for donation; yes, please do include Paypal, as too many folks want to hold on to their coins and are more willing to donate fiat instead.

I was also hit by this but am glad to see it's being worked on! I'm also very happy to hear you're reducing timings inside the code. 100s of ms is actually quite a lot and could mean less stale/rejected shares, especially given the rate at which some pools send you shares (it's sometimes 1-2 seconds, so 100ms is a good chunk).

Also, could you please, PLEASE, PLEASE, code in some better failure and recovery routines (and watchdog) for when a GPU crashes. I'll move it later to a separate thread for this but here's what I mean (I use Nvidia by the way) - I'd love to get my hands dirty, do a pull and implement these myself in ethminer if I wasn't so busy with other things; I prefer to make a donation to you so you have an incentive to do it:

ethminer is by far my most problematic miner when it comes to GPU errors/crashes. It's not able to recover by itself, and what's worse is that sometime it doesn't even exit on errors. Other miners (e.g. for equihash) are far better at this and can restart the bad GPU via the driver or at least exit.
the Nvidia driver reports faults to the kernel as so called "Xid" errors. I'm monitoring the kernel output for these and then force a kill on the ethminer process and restart it. It doesn't work for all Xids though, as some need a driver restart too, but Nvidia, being Nvidia, is utterly pesky and doesn't provide command line tools to allow you to restart a single card if any other cards are being used; the tool they provide requires you to first stop all GPU processes running on any other card, then restart said single GPU, then restart your GPU app on each GPU. Ethminer could do this restart via CUDA and resume mining. The DSTM and Bminer miners do this on equihash and it's a much smoother mining experience.
I'd love it if ethminer had an option to NOT exit on a GPU error, but to keep mining with the other GPUs (just kill only that particular thread but keep the rest). Right now I start one ethminer process per GPU in order to achieve this, but it gets a little messy to monitor, especially on machines with lots of GPUs per motherboard.
in Linux, sometimes ethminer hangs completely in a zombie state, and can't be killed but still consumes a lot of CPU. This is even after unloading the nvidia driver. It's always after some GPU crashed.

jean-m-cyr commented 6 years ago

@AndreaLanfranchi Thanks for the feedback. I only found this repo about a month ago so this is all new. I work on this software because I use it personally. I tend to focus on areas of personal interest like performance and massively parallel synchronous processing. I don't even know if I have the required skills to address some of the things you talk about, but I'll keep them in mind as I travel this code.

My main objective presently is optimizing Nvidia GPU usage. Per card recovery, would be an great project, but it might be easier said than done. I'm discovering that ethminer is an extremely fragile structure; fragile in the sense that it has a lot if interdependent moving parts and small changes in one part will often have unintended consequences elsewhere.

ghost commented 6 years ago

@jean-m-cyr thanks for the fix!

ghost commented 6 years ago

Ran overnight and got minimal stales on ethermine.

chfast commented 6 years ago

Can we close it?

Awesome work @jean-m-cyr!

jean-m-cyr commented 6 years ago

I would say so

aleqx commented 6 years ago

I should probably create a new issue for this, but knowing @jean-m-cyr worked on the code for CUDA I'm posting here - please let me know and I'll move it:

After I pulled his commit above (with the fix) I see two new behaviors that weren't there before with 0.12, one which is serious:

the good: hashrate seems more stable, and every so slightly higher
the bad: i'm getting quite a lot more crashes -- 17 in the past 24h with the new ethminer, compared to virtually 0 in the past 3 weeks with v0.12 -- of two kinds:
```
CUDA error in func 'search' at line 506 : unspecified launch failure.
✘  11:33:09|cuda-0    Error CUDA mining: unspecified launch failure
```
and
```
CUDA error in func 'search' at line 506 : an illegal memory access was encountered.
✘  11:36:22|cuda-0    Error CUDA mining: an illegal memory access was encountered
```
Even though the two errors are different, the Nvidia driver reports the same Xid code: 31 (see list of Xid codes). This is a recoverable error, a memory page fault, and all that's needed is to kill ethminer and relaunch it.

I compiled the code against CUDA 9.1 with the 390.12 driver. Previously I was running 0.12 binary (downloaded from github) with CUDA 8.0.

I did not use to get these with the previous 0.12 stable. My cards are in a controlled temperature environment (GPUs are 45-55C at full load). The GPUs are overclocked, but I have not changed the overclocking settings at all. They had been running for many weeks without a hitch until now with v0.12.

Does the new code add more stress than 0.12 used to do? I recall @jean-m-cyr mentioning somethign about him reducing timings int he new code. If it does add more stress than 0.12 then it's possible that it pushes the GPUs a little closer to the instability region given they are overclocked, whereas 0.12 did not.

Also, is it possible that cuda9.1 and/or the 390.12 driver are at fault? I will do tests myself, but as you know, such tests take a loooong time in order to be conclusive ...

The cards in question are GTX 1070.

MariusVanDerWijden commented 6 years ago

@aleqx Yes thats an error we often see with high overclocked cards, maybe you should dial down the overclock a bit. I think the new fix does add more stress to the gpu (faster switching, less idle) so it will offset any potential decrease in performance from dialing down your overclock. I think you'll see an improved hashrate with the new fix, even with a lower overclock

chfast commented 6 years ago

@MariusVanDerWijden Can we catch this error and report proper warning?

aleqx commented 6 years ago

That would be great, because Xid 31 (see http://docs.nvidia.com/deploy/xid-errors/index.html) is not a fatal error. Mining can actually continue, there shouldn't be any need to kill, reload the DAG, etc. Making it just a warning would be great, but I'm not sure if cuda allows you to see the Xid error code, or actual source of the error - as I said above, the Xid 31 error from the driver causes different errors in ethminer, depending on what ethminer was doing at the time (but they all should be just warnings).

ethereum-mining / ethminer

After commits @Jan 19, 2018 #595