Closed AndreaLanfranchi closed 6 years ago
@AndreaLanfranchi Yep. I totally screwed up and merged a bad commit. Something I pushed from the wrong machine!!! I've tried to correct the mistake with pull request #594 but too late for the nightly. :-( If you can build you can create the corrected version with
git clone https://github.com/ethereum-mining/ethminer.git
cd ethminer
git pull origin refs/pull/594/head
mkdir build
cd build
etc...
Once again. My apologies for messing up.
If you are running windows you can get the fixed image at: https://ci.appveyor.com/api/buildjobs/bb99jn7wug4ri7sm/artifacts/build%2Fethminer-0.13.0.dev0-Windows.zip Unfortunately the ci build servers seem to be broken for Linux right now.
Thank you for your work @jean-m-cyr I am already back to home right now. Will apply your suggestion tomorrow or (more likely) on Sunday. I appreciate your efforts.
Best
@AndreaLanfranchi
Will apply your suggestion tomorrow or (more likely) on Sunday.
I don't even know what day of the week it is anymore! :-) Hopefully this will be back into the mainline by Sunday.
@AndreaLanfranchi If it's any consolation, all of this restructuring and turmoil was necessary so that I can get to work on shortening the job switch time for cuda. It presently is excessively long and variable, sometimes reaching 150 ms. That's time wasted not searching for all GPUs.
We already made the path from the time we discover a solution till the time we send to the pool very light. Now I'd like to work at the other end, minimizing the time it takes for stratum to get the GPUs started on new work. Not only is this path slow, but it can interfere with the sending of solution. Cutting off the GPUs earlier from working on old jobs might reduce the share rate... who knows?
@jean-m-cyr Many thanks for your work! It really is very appreciated, and I would like to donate. You should think about putting Paypal and ETH addresses for donation; yes, please do include Paypal, as too many folks want to hold on to their coins and are more willing to donate fiat instead.
I was also hit by this but am glad to see it's being worked on! I'm also very happy to hear you're reducing timings inside the code. 100s of ms is actually quite a lot and could mean less stale/rejected shares, especially given the rate at which some pools send you shares (it's sometimes 1-2 seconds, so 100ms is a good chunk).
Also, could you please, PLEASE, PLEASE, code in some better failure and recovery routines (and watchdog) for when a GPU crashes. I'll move it later to a separate thread for this but here's what I mean (I use Nvidia by the way) - I'd love to get my hands dirty, do a pull and implement these myself in ethminer if I wasn't so busy with other things; I prefer to make a donation to you so you have an incentive to do it:
ethminer is by far my most problematic miner when it comes to GPU errors/crashes. It's not able to recover by itself, and what's worse is that sometime it doesn't even exit on errors. Other miners (e.g. for equihash) are far better at this and can restart the bad GPU via the driver or at least exit.
the Nvidia driver reports faults to the kernel as so called "Xid" errors. I'm monitoring the kernel output for these and then force a kill on the ethminer process and restart it. It doesn't work for all Xids though, as some need a driver restart too, but Nvidia, being Nvidia, is utterly pesky and doesn't provide command line tools to allow you to restart a single card if any other cards are being used; the tool they provide requires you to first stop all GPU processes running on any other card, then restart said single GPU, then restart your GPU app on each GPU. Ethminer could do this restart via CUDA and resume mining. The DSTM and Bminer miners do this on equihash and it's a much smoother mining experience.
I'd love it if ethminer had an option to NOT exit on a GPU error, but to keep mining with the other GPUs (just kill only that particular thread but keep the rest). Right now I start one ethminer process per GPU in order to achieve this, but it gets a little messy to monitor, especially on machines with lots of GPUs per motherboard.
in Linux, sometimes ethminer hangs completely in a zombie state, and can't be killed but still consumes a lot of CPU. This is even after unloading the nvidia driver. It's always after some GPU crashed.
@AndreaLanfranchi Thanks for the feedback. I only found this repo about a month ago so this is all new. I work on this software because I use it personally. I tend to focus on areas of personal interest like performance and massively parallel synchronous processing. I don't even know if I have the required skills to address some of the things you talk about, but I'll keep them in mind as I travel this code.
My main objective presently is optimizing Nvidia GPU usage. Per card recovery, would be an great project, but it might be easier said than done. I'm discovering that ethminer is an extremely fragile structure; fragile in the sense that it has a lot if interdependent moving parts and small changes in one part will often have unintended consequences elsewhere.
@jean-m-cyr thanks for the fix!
Ran overnight and got minimal stales on ethermine.
Can we close it?
Awesome work @jean-m-cyr!
I would say so
I should probably create a new issue for this, but knowing @jean-m-cyr worked on the code for CUDA I'm posting here - please let me know and I'll move it:
After I pulled his commit above (with the fix) I see two new behaviors that weren't there before with 0.12, one which is serious:
CUDA error in func 'search' at line 506 : unspecified launch failure.
✘ 11:33:09|cuda-0 Error CUDA mining: unspecified launch failure
and
CUDA error in func 'search' at line 506 : an illegal memory access was encountered.
✘ 11:36:22|cuda-0 Error CUDA mining: an illegal memory access was encountered
Even though the two errors are different, the Nvidia driver reports the same Xid code: 31 (see list of Xid codes). This is a recoverable error, a memory page fault, and all that's needed is to kill ethminer and relaunch it.
I compiled the code against CUDA 9.1 with the 390.12 driver. Previously I was running 0.12 binary (downloaded from github) with CUDA 8.0.
I did not use to get these with the previous 0.12 stable. My cards are in a controlled temperature environment (GPUs are 45-55C at full load). The GPUs are overclocked, but I have not changed the overclocking settings at all. They had been running for many weeks without a hitch until now with v0.12.
Does the new code add more stress than 0.12 used to do? I recall @jean-m-cyr mentioning somethign about him reducing timings int he new code. If it does add more stress than 0.12 then it's possible that it pushes the GPUs a little closer to the instability region given they are overclocked, whereas 0.12 did not.
Also, is it possible that cuda9.1 and/or the 390.12 driver are at fault? I will do tests myself, but as you know, such tests take a loooong time in order to be conclusive ...
The cards in question are GTX 1070.
@aleqx Yes thats an error we often see with high overclocked cards, maybe you should dial down the overclock a bit. I think the new fix does add more stress to the gpu (faster switching, less idle) so it will offset any potential decrease in performance from dialing down your overclock. I think you'll see an improved hashrate with the new fix, even with a lower overclock
@MariusVanDerWijden Can we catch this error and report proper warning?
That would be great, because Xid 31 (see http://docs.nvidia.com/deploy/xid-errors/index.html) is not a fatal error. Mining can actually continue, there shouldn't be any need to kill, reload the DAG, etc. Making it just a warning would be great, but I'm not sure if cuda allows you to see the Xid error code, or actual source of the error - as I said above, the Xid 31 error from the driver causes different errors in ethminer, depending on what ethminer was doing at the time (but they all should be just warnings).
I have just compiled my "nightly" build from MASTER branch.
Unfortunately I have to record some new issues and confirm older ones:
If something wrong from one of the threads for GPU the program simply stops responding, stops outputting any hash rate,all GPUs get idle and the process does not stop: (there are no visible logs from halted GPUs and / or program errors.
Stale share ratio keeps staying 110% more than claymore's baseline.
Yet I cannot overclock GPU's at the same level of Claymore's thus keeping me below in hashrate of a good 10% (which is not compensated by the absence of a fee)
Here is a sample of 24hours round on Claymore's 10.
And here the situation after activation of ethminer on miner "Nostromo". See stale shares increase and instability.
Please let me know if I can help.