ethereum-mining / ethminer

Ethereum miner with OpenCL, CUDA and stratum support
GNU General Public License v3.0
5.97k stars 2.28k forks source link

Stale share issue with 0.16.1 and link to search segment randomization #1646

Closed aleqx closed 5 years ago

aleqx commented 5 years ago

EDIT 2018-10-21: The discussion deviated (a lot) from the initial question, but gave rise to a more useful discussion regarding stale ratio comparison between ethminer versions and the role played by randomization in the stale ratio. The reader can start reading here: https://github.com/ethereum-mining/ethminer/issues/1646#issuecomment-431668391

Feature request based on this discussion was posted here: https://github.com/ethereum-mining/ethminer/issues/1650


I've been using my own modded 0.12 (which removes software evals) since last year, and have been trying all the newer versions when you guys posted them to see if it's worth upgrading.

I just tested 0.16.1 on Ubuntu 16.04 and all of a sudden I'm getting this on some cards, not all:

Error CUDA mining: CUDA error in func init at line 128 out of memory

These are GTX 1070 Ti. Out of memory shouldn't be possible. Never got that error with any of the previous versions (tried the final 0.13, 0.14, 0.15).

This happened after a Xid 32 error, which is a recoverable error - simply killing and restarting the cuda app solves it, and that has been the case for all versions until 0.16.1. It seems 0.16.1 allocates memory differently, or something?

EDIT: may be relevant: for 0.16 I'm using cuda 10, while for 0.12 I'm using cuda 9.1.

aleqx commented 5 years ago

Just when I loaded it all in CLion ... thanks :+1:

aleqx commented 5 years ago

There are still too many CLI arguments in ethminer imho.

There could be a --help and --help-advanced and by default --help could show only the common options most folks are interested in.

AndreaLanfranchi commented 5 years ago

Already done on my refactored repo

aleqx commented 5 years ago

A CLI option to set m_nonce_scrambler for each instance would also be nice.

Edit: https://github.com/ethereum-mining/ethminer/blob/master/docs/API_DOCUMENTATION.md#miner_setscramblerinfo ... nice!

aleqx commented 5 years ago

Hmm, in 0.16 I'm seeing lots of:

00:19:56 stratum  **Accepted(stale) 112 ms. proxy [192.168.10.101:4444]

note the "stale"

The pool doesn't indicate any stales (all are accepts) so ethminer here must be doing some local interpretation of timestamps of the pool reply and whether it arrived before/after the current job broadcast, but I thought you said it doesn't do that ... so how is the above "stale" computed?

p.s. yes, i should pull the code and get my hands dirty

AndreaLanfranchi commented 5 years ago

The pool doesn't indicate any stales (all are accepts) so ethminer here must be doing some local

A "stale" share is treated differently by different pools. The definition of "stale" is "a solution which has arrived late ..." : late about what (you may ask) ? The correct definition is a "a solution for a block which is no more the current block being mined" or, in other words, "a solution found for a block which has been already mined and is permanently inserted in the blockchain".

Unfortunately ethminer has very limited capabilities on determining which is the block number being mined: 90% of pools, in fact, do not push the block number in the mining.notify message. So it relies on a very simple principle: "if the solution found is related to a job id which is not the latest job id received (and this happens frequently as jobs in many cases arrive very near to each other) then mark the solution as stale"

On pools side the accounting of stale shares is an async process thus some pool "reject" the stale shares, others accept it and then, later, account them as stale and show the stale percentage on their stats pages.

aleqx commented 5 years ago

That's what I was doing in my proxy too, but according to our previous discussion in this thread, we concluded it's not accurate to label stales based solely on the job id, since we don't know whether two different job ids are from the same block or not. Ethermine does not indicate the block number in it's mining.submit broadcast, so the stales declared by ethminer may not be actual stales at the pool.

AndreaLanfranchi commented 5 years ago

Ethermine does not indicate the block number in it's mining.submit broadcast,

That's not completely accurate anymore (a matter of a couple of days) ... look at the fourth parameter in mining.notify message. ;)

AndreaLanfranchi commented 5 years ago

Anyway I'm going to abandon counting of stale shares on miner's side: the only trustable point to account them at pool's (or better saying at node level). Counting at miner have no meaning and gives the user only a false perception of efficiency: for instance a share submitted by miner - which has not accounted it as stale - yet has a lot of time to become stale (the whole time to pool - from 10 to hundreds ms depending on the quality of the connection - plus the time used by the pool to validate it).

jean-m-cyr commented 5 years ago

With the improvements made to the GPU search methods, miner detected shares are approaching 0 anyway.

aleqx commented 5 years ago

Well, labelling stale/accepted is done upon receiving the pool's reply to the miner's previous mining.submit ... it depends on the pool and the connection to the pool, and it's unreliable anyway. I'd remove the stale label in the miner completely if it was up to me, and just relay whatever the pool says.

jean-m-cyr commented 5 years ago

The pool doesn't say anything we can relay. It reports all valid shares as ok, and makes the stale determination silently.

aleqx commented 5 years ago

The pool says either ok or error (accept or reject). Relay whatever the pool says and don't try to make any local inferences since they are bound to be inaccurate. No more stales. Let the user check the stale rate at the pool after a sufficiently high number of blocks.

I'm agreeing with Andrea, is what I mean.

jean-m-cyr commented 5 years ago

The only time the pool says reject is when the miner submits an invalid solution, otherwise it responds with accept. We count rejects under the failed category.

AndreaLanfranchi commented 5 years ago

Not true. NiceHash for example rejects stales (with explicit "Stale" message).

AndreaLanfranchi commented 5 years ago

And "invalid" solution means several things : for instance it may be invalid in the meaning it's badly computed, or cause the job it refers to is not found. It depends on pool implementation to account how many jobs keep in the stack for each session.

aleqx commented 5 years ago

or cause the job it refers to is not found

They usually mean stale by that, but yes, it's still a rejected share.

@jean-m-cyr I meant to just relay what the pool says, whatever that is, ok or reject or whatever else (in some other algos pools also indicate bad shares, i.e. wrongly computed). I't pretty simple. Then count all "ok" in the A category, and all non-ok in the R category and get rid of the F category.

AndreaLanfranchi commented 5 years ago

They usually mean stale by that, but yes, it's still a rejected share.

Well it's an assumption I would not rely on. Some pools based on geth nodes may regenerate several works for the same block. If the stack of jobs is not wide enough to account all jobs for a block you may end up with a share marked as stale which is not.

AndreaLanfranchi commented 5 years ago

@jean-m-cyr CLMINER does not even evaluate staleness https://github.com/ethereum-mining/ethminer/blob/master/libethash-cl/CLMiner.cpp#L416

jean-m-cyr commented 5 years ago

True, and we should turn it off for CUDA as well. But then... all of that tedious formatting stuff needs to be redone everywhere we display.