ethereum-mining / ethminer

Ethereum miner with OpenCL, CUDA and stratum support
GNU General Public License v3.0
5.97k stars 2.28k forks source link

Stale share issue with 0.16.1 and link to search segment randomization #1646

Closed aleqx closed 5 years ago

aleqx commented 5 years ago

EDIT 2018-10-21: The discussion deviated (a lot) from the initial question, but gave rise to a more useful discussion regarding stale ratio comparison between ethminer versions and the role played by randomization in the stale ratio. The reader can start reading here: https://github.com/ethereum-mining/ethminer/issues/1646#issuecomment-431668391

Feature request based on this discussion was posted here: https://github.com/ethereum-mining/ethminer/issues/1650


I've been using my own modded 0.12 (which removes software evals) since last year, and have been trying all the newer versions when you guys posted them to see if it's worth upgrading.

I just tested 0.16.1 on Ubuntu 16.04 and all of a sudden I'm getting this on some cards, not all:

Error CUDA mining: CUDA error in func init at line 128 out of memory

These are GTX 1070 Ti. Out of memory shouldn't be possible. Never got that error with any of the previous versions (tried the final 0.13, 0.14, 0.15).

This happened after a Xid 32 error, which is a recoverable error - simply killing and restarting the cuda app solves it, and that has been the case for all versions until 0.16.1. It seems 0.16.1 allocates memory differently, or something?

EDIT: may be relevant: for 0.16 I'm using cuda 10, while for 0.12 I'm using cuda 9.1.

XhmikosR commented 5 years ago

Usually this comes from OC. Newer drivers push things further sometimes.

aleqx commented 5 years ago

I didn't change the driver. Been using the same driver for months (390.48). The above is an error I haven't seen before ... is it from the newer CUDA?

XhmikosR commented 5 years ago

The question is, do you OC? If you do try lowering your OC. If not then you could stick to the old driver if it works.

aleqx commented 5 years ago

Indeed. I did lower my OC and the error went away, but went back 0.12 anyway because 0.16.1 gave almost 10x more stale shares on ethermine than 0.12 ... I tested it for 2h, stale shares raised to 6%, then switched to 0.12 and it went back down to 1%. I sniffed the rpc comms and could see that I was receiving lots of result":true messages slightly after the new job was already issued (i.e. stale, but not reported as such, typical ethermine). This doesn't happen on 0.12 ...

It's almost as if 0.16 causes ethermine to delay its replies ... doesn't make sense, but something's afoot.

aleqx commented 5 years ago

I'll put that in a new issue in case others met it

AndreaLanfranchi commented 5 years ago

Stale ratio comparison may not be significant unless you're completely sure you connected to the very same IP address.

In addition ... with 0.16 you have to add --noeval CLI argument to compare to your modded 0.12

aleqx commented 5 years ago

I did exactly that (--noeval) and used the IP address instead of fqdn for the pool. Pool response times were also the same (did average + stddev comparison too from logs).

AndreaLanfranchi commented 5 years ago

I can only give you my personal findings on my rigs.

As far as I know there is not so much difference among 0.16 and 0.17 at socket level nor there are significant differences in solution submissions (only some tricks to compute a smoother HR average).

aleqx commented 5 years ago

With 0.12 my stale ratio is 1% for more than 95% of the time (used to be 3% before they CloudFlare'd it). now it only goes up to 2% very rarely.

With 0.16.1 it spikes to 6-8% immediately (within 1h, which is ethermine's window) and doesn't drop. Going back to 0.12 it drops back to 1% immediately and doesn't increase. Did that test twice yesterday. This does not sound like coincidence.

I note that I'm using one miner per GPU (not one miner per system). So far i tried 0.16 with CUDA 10, and 0.12 with CUDA 9.1. EDIT: Tried 0.16 with CUDA 9.1 as well, same result.

AndreaLanfranchi commented 5 years ago

I note that I'm using one miner per GPU (not one miner per system).

I recall we discussed this matter earlier somewhere. For a lot of technical reasons ethminer behaves badly when running one instance per GPU: we've developed ethminer with the better possible distribution of nonce ranges to every gpu connected to a single instance. When running multiple instances be advised that each ethminer instance is completely isolated (there is no knowledge of each other) thus creating these problems :

I understand someone believes that running multiple instances is somehow an assurance that when a GPU stalls or hangs or even falls off the bus the others keep running ... but this is true also for a single instance of ethminer as each GPU does its work on a separate thread.

This said ... I'm not meaning to force you to abandon the 1 instance for 1 GPU model ... but you can't compare different versions which have evolved to a very different path.

AndreaLanfranchi commented 5 years ago

Returning to the opening issue ... "Insufficient memory" ... this is due to the fact the CUDA engine does not reset properly the GPU memory. As you noted a kill of the process or (simply) a reboot of the machine solves the problem.

We're still struggling to manage Xid errors as they're returned by a process which is outside ethminer scope.

aleqx commented 5 years ago

You're wrong about the per-gpu miner (I haven't discussed it, btw) but I'll leave that for another time. To stay on topic, running per-gpu miners should not give a higher stale rate. Simply saying that newer ethminer versions have a better threading model (finally! yay!) is not an explanation nor is it relevant to the stale rate.

Higher rejected rate would have been more expected, if they seeded their nonce search with the same seed, but the rejected rate is zero.

Something else is causing it: See my 3rd message - I sniffed the json-rpc comms and I am seeing that the pool actually replies with "result":true, i.e. share "accepted", many times after it has already issued a new job, i.e. it's an actual stale (despite ethermine.org saying it's accepted -- they do that on purpose). I built statistics after a few hours and was able to see that the number of times this happens under 0.16.1 is much higher than under 0.12 ... at first glance you'd say this is a pool issue, but switching to/from 0.16 from/to 0.12 causes an immediate change in stale rate. I did that A/B test several times now.

I noticed that 0.16 now includes a new field "worker":"foo" in the mining.submit and at first I thought that's the culprit. But I stripped that away and it didn't change anything ...

Right now I can't explain it.

AndreaLanfranchi commented 5 years ago

You're wrong about the per-gpu miner (I haven't discussed it, btw) but I'll leave that for another time.

No I am not but will gladly examine some data instead of opinions.

Higher rejected rate would have been more expected, if they seeded their nonce search with the same seed, but the rejected rate is zero.

There is a misconception about this. Pools do detect duplicate shares (thus rejecting them) if they come in within the scope of the same socket/session. Running one socket per gpu the pool will not reject duplicate shares as they come from different sessions (it's impossible for an instance of ethminer to produce duplicate shares while it's highly possible multiple instances of ethminer to produce duplicate shares). You can easily verify this on a widely used and open source pool (sammy007's) where duplicate rejetcs are detected at session level. In other words: when two (or more) different sockets do send the very same nonce, none of them will receive a "rejected" message. The first socket winning the race will have its share computed as valid, the others will have their shares asynchronously computed as stale.

Start nonce matter: every ethminer instance which is NOT connected to an EthereumStratum/1.0.0 pool (NiceHash) elects as a start searching nonce a randomly selected one plus a segment (which defaults to 2^40 width) for every gpu connected to the same instance. https://github.com/ethereum-mining/ethminer/blob/master/libethcore/Farm.cpp#L83_L91

While this ensures all gpus connected to a single ethminer instance do work on different search segments, if you work with different instances per GPU you have no guarantee your GPUs end up working on overlapping ranges because each instance have no knowledge of the start nonce in use by other instances.

it's an actual stale (despite ethermine.org saying it's accepted -- they do that on purpose)

This statement puzzles me. Stales are processed async. It's not they do it on purpose: they have no other way to do that as it would be impossible to reply in real time. A stale share it's not a matter related to a single session: it's a global matter which involves all miners in all pools. Suppose I submit a valid share in pool A and, in the same pool I am the first ... my share could be end-up being stale cause the block have been certified split milliseconds earlier on another pool (remember we all mine on the same block-chain regardless the pool we choose).

Another point : ethminer marks as stale each share submitted while another job has become current. This is only one of the various reasons for a share to get stale: the easiest do detect. But a share can become stale also when the job is not yet arrived to the miner but it's on it's way ... (network latency).

Simply saying that newer ethminer versions have a better threading model (finally! yay!) is not an explanation nor is it relevant to the stale rate.

Yes it is ... each instance will keep polling on the network interface to detect incoming data. https://github.com/ethereum-mining/ethminer/blob/master/libpoolprotocols/stratum/EthStratumClient.cpp#L1406_L1420 If you use one instance per gpu you will have all instances racing one against the other for NIC data thus increasing processing lags.

I did that A/B test several times now.

It seems the only test you've not done yet is to run ethminer as a single instance for all your gpus

AndreaLanfranchi commented 5 years ago

Oh ... know I recall ... you're the one who I promised to never get involved again with your issues. Sorry about that : lot of work on ethminer and many issues ... sometimes I don't recall all. I'm out. Have a nice day.

aleqx commented 5 years ago

Bottom line is you're right - I should also do the test you suggest to rule out all new variables. But I'm afraid that what you worry about so much above is known as premature optimisation, and is not a bottleneck at all (if it was I'd see high stale rates with 0.12 too).

While this ensures all gpus connected to a single ethminer instance do work on different search segments, if you work with different instances per GPU you have no guarantee your GPUs end up working on overlapping ranges because each instance have no knowledge of the start nonce in use by other instances.

Theoretically, yes. However, when grouped, each miner also has a reduced search space. The nonce collision probability is negligibly low in both cases and really nothing to worrying about. For 600 GPUs, you are looking at 9.7e-15 collision probability when running one per gpu, versus 1.2e-15 when grouping them into 8 per rig ...

I could ensure each miner works on a different search space segment in my proxy, and I did consider it at first but then realized it's not a bottle neck at all, just premature optimization and time wasteful.

Similarly for the time-to-socket contention. It's a non-issue. First, the probability of at least 2 miners finding a share within the same microsecond is ~0. Then you're comparing nanoseconds-to-microsecond (time to socket access) to tens or hundreds of milliseconds (internet routing to pool). More relevant would be the mining.notify broadcasts, which affects the start time of each miner, but even that's negligible (and can be nicely addressed by a local proxy - I'm doing this, not because I'm worried about stale shares, but because many pools confuse hundreds of connections with DoS attacks and reject the connections).

If any of this was actually relevant then I would also see a high stale rate with 0.12 which I've been running as one-per-gpu, for hundreds of GPUs. But I'm not. I have tangible benefits in terms of recovering from Xid errors without having to restart all GPUs in a rig. I wrote a long post about this a good while back and highlighted to the devs at that time which Xid's are recoverable and how.

Running many hundreds of GPUs is different to running a tiny number ... how many GPUs did you actually test at once?

Re: stale shares: What I meant was that ethermine.org does easily detect stale shares (stale state is measured at the pool - share arrives after the pool generated the next job, regardless of whether the new job reached the miner or not) and they can, of course, signal this to the miner if they want to when sending the "result" json-rpc reply to the mining.submit (they can send "result":false instead of "result":true with some extra parameter to indicate stale instead of reject) but ethermine.org chose not to. They chose to return "result":true for either valid, stale or duplicate shares. They return "result":false only for bad shares (wrong hash). This is what I could infer from studying their behavior when I wrote my proxy code, and from their docs.

Locally you can actually detect if a submitted share was actually stale or not, despite ethermine.org telling you it was accepted, owing to tcp being stateful.

AndreaLanfranchi commented 5 years ago

Only for sake of science

However, when grouped, each miner also has a reduced search space

Not true ... the segment width is at GPU level. Grouped or not 100 GPU will always be in charge to scan (by default values of ethminer) 2^40*100 nonces.

More relevant would be the mining.notify broadcasts

Another topic wich may lead to confusion. As we're talking about tcp sockets there is no broadcasting going on (like in UDP). Pool's only option to notify all miners for new jobs is a loop like this (very summarized)

    for (size_t i = 0; i < connectedminers; i++)
    {
        connectedminers[i]->sendjob();
    }

This is time consuming: the more sockets connected the more delay in receiving new jobs. You can add load balancers and whatever you want to optimize ... nevertheless due to the nature of tcp you have to process each socket individually.

What I meant was that ethermine.org does easily detect stale shares (stale state is measured at the pool - share arrives after the pool generated the next job,

That's not true. A stale share occurs when you find a share and submit it to the mining pool after the pool's node has already moved on to the next block. i.e. when a new block has been chained in the BC and new block candidates are being processed (this is an async operation occuring among nodes building a consensus). This implies the pool can send you multiple jobs (one for each block candidate) to process and yet does not know if that block will be the final, an uncle or dropped due to invalidity. You can easily see this when pool sends jobs with almost no interval among them. In other words ... a new job notification does not necessarily imply the previous job being submitted (and under process) will eventually produce a stale solution.

Due to the above at the very moment the pool receives a solution from you it may not know whether or not the solution will be stale as the block is not yet approved by the consensus with other nodes.

they can send "result":false instead of "result":true with some extra parameter to indicate stale instead of reject) but ethermine.org chose not to

AFAIK all pools do the same. It's not an ethermine.org peculiarity

Locally you can actually detect if a submitted share was actually stale or not

Not true either unless you're solo mining with your own full node and do query node for current block number (which ethminer can't detect as the seed hash received in the work package is the hash of the epoch only and carries no information about the block being mined). The miner has no knowledge about the block ... it only receives a hashing instruction with very minimal info: a seed, a header and a boundary.

The solution of marking as stale the solution produced for a job which is not the "latest" job (which is actually what ethminer does) is a dirty and incomplete trick.

Running many hundreds of GPUs is different to running a tiny number ... how many GPUs did you actually test at once?

Last but not least ... more than 10^3. ;)

aleqx commented 5 years ago

True regarding jobs vs blocks - thanks for the correction. Locally I can only flag job-stale shares (not block-stale) which does not necessarily imply stale share as seen by the pool. The pool could indicate to miners the number of stales, but not at share submission time. I guess they chose not to because they didn't want to bother hacking the (shitty) stratum protocol even more ...

2^40*100 nonces.

https://github.com/ethereum-mining/ethminer/blob/master/libethcore/Farm.cpp#L234 says the searchspace is 2^40 ... my apologies, it's 1.6e-7 versus 2.0e-8 collision probability for 600 miners ...

Another topic wich may lead to confusion. As we're talking about tcp sockets there is no broadcasting going on (like in UDP).

There is no broadcast in UDP either, but we both know what we're talking about since we're both devs and have written socket-based code. You know what I meant. The pool sends the same mining.notify json-rpc (over tcp) to all your connected miners (loosely labeled 'broadcast' from the stratum protocol point of view, a very poorly designed protocol but alas). While this means 8 instead of 1 messages to the same IP, it's still not a bottleneck (I've done enough test to know). Contention regarding mining.submit is even less of an issue.

The bigger issue that ethminer fixed a while back (I think it was @jean-m-cyr (?)) was to use different threads for mining and stratum. 0.12 is old and single thread, but running one miner per gpu makes up for that bottleneck almost entirely ... I usually only try new versions of ethminer to see if ethminer finally implemented proper Xid recovery. That's by far a much bigger bottleneck and hashrate killer for me than all optimizations that existed since 0.12, without exception.

AndreaLanfranchi commented 5 years ago

hacking the (shitty) stratum protocol even more

Well ... actually there are three flavours of Stratum : pure stratum; eth-proxy compatible and EthereumStratum/1.0.0 (aka NiceHash's). Needless to say the latter is the best and less prolly. Nevertheless all three has an additional boolean flag in the workpackage which dictates whether or not all previuosly sent jobs should be abandoned immediately. This gives us a picture of the fact that the miner could work on multiple jobs at a time. No pool apparently values it or it's always true. (in ethminer we do not even check it). Unfortunately ethereum's stratum adoption is not a standard and comes from very undocumented, fancy interpretations and free implementations from clients and pools.

searchspace is 2^40 ... my apologies, it's 1.6e-7 versus 2.0e-8 collision probability for 600 miners ...

Probability != certainity.

There is no broadcast in UDP either

Why do you say so ? Well apparently I've written code which implements broadcasting and didn't even know it could not be done. Strangely enough it works. ;)

The pool sends the same mining.notify json-rpc (over tcp) to all your connected miners

Apparently this does not happen. Monitor side by side two or more rigs (or instances) and see if they receive the same jobs with very similar timings. Personally I read different headers, different timings (slightly different). Under these circumstances the most efficient way to find shares is to cluster GPUs in the widest and most stable groups possible. Hardware has a limit ... but with software...

The bigger issue that ethminer finally fixed a while back (I think it was @jean-m-cyr (?)) was to use different threads for mining and stratum.

It's since 0.14 https://github.com/ethereum-mining/ethminer/releases/tag/v0.14.0

I only try new versions of ethminer to see if you guys finally implemented proper Xid recovery.

We're not even near nor, afaik, anyone of us is working on this matter.

aleqx commented 5 years ago

Well ... actually there are three flavours of Stratum

That's just for Eth. There's a myriad of hacks and flavours for other coins. I wouldn't think anything coming out of NIcehash is of any quality, but that's just me. There's also no "standard" whatsoever ... it's just a jungle.

Probability != certainity.

Huh?

Well apparently I've written code which implements broadcasting and didn't even know it could not be done. Strangely enough it works. ;)

I'm curious, why do you prefer antagonizing? Normally folks do it because they yearn to quarrel. Otherwise what exactly do you expect will be the outcome?

Even if mining used UDP, there wouldn't be any actual broadcast (in the IP sense, with a dedicated broadcast address etc).

Apparently this does not happen.

mining.notify has the same job id in param[0], which is sent to all miners. seed and header hashes may be different.

We're not even near nor, afaik, anyone of us is working on this matter.

I wish that was different. From my point of view (large miner) it's a more impactful aspect than any of the improvements/fixes implemented since 0.12. You guys work on whatever you fancy, of course, I'm still enjoying ethminer and am grateful for it.

AndreaLanfranchi commented 5 years ago

That's just for Eth

We're trying to propose EIP for stratum standardization but apparently there is not so much interest on it ... PoS is knocking ot the door.

I'm curious, why do you prefer antagonizing? Normally folks do it because they yearn to quarrel. Otherwise what exactly do you expect will be the outcome?

I'm not antagonizing ... I just try to make you understand that strong statement as "There is no broadcast in UDP either" is formally wrong. We may argue if this is (or not) applicable to mining (well it can be (I personally have a proxy which actually broadcasts - in local LAN - jobs to a modded version of ethminer which uses UDP for work dispatching).

mining.notify has the same job id in param[0], which is sent to all miners. seed and header hashes may be different.

Sorry my mistake.

From my point of view

Yes ... a POV of an NVIDIA/CUDA miner. Majority of ethminer's users are on AMD.

My personal (!!) feeling about Xid errors is they're 97.5% caused by excessive OC while the rest 2.5% by defective PCI connections. I quite prefer fine tuning at stable OC values rather than having the mining software to recover from Xid errors. Last time I tried to force reset a GPU from an Xid error the cudaDeviceReset() call took roughly 20 minutes (and for sure the GPU was not happy). A cold reboot in those extreme cases is better welcome. I reiterate this is my opinion.

aleqx commented 5 years ago

I quite prefer fine tuning at stable OC values rather than having the mining software to recover from Xid errors.

That to me suggests you haven't dealt with large operations. Keeping high o/c and recovering from errors gives (much) higher average hashrate than "stable" o/c (I explained this in one older post where I detailed the Xid errors) - large scale offers different incentives than home/small mining. Nvidia/AMD is irrelevant here (I own AMD equipment too).

edit: soon enough the majority of eth hashrate will be Bitmain anyway (another company I hate with a passion).

AndreaLanfranchi commented 5 years ago

Do not know how "large" I am ... but I'm quite satisfied.

I explained this in one older post where I detailed the Xid errors

If you want to dirt your hands and amend the code ... you're welcome. Anyone can submit a PR

aleqx commented 5 years ago

If you have 50+ GH/s and are not interested in maximizing your hashrate then that's your call. 0.5% of that is the equivalent of 8 GTX1070 or about $3-4k investment. But whatever ...

AndreaLanfranchi commented 5 years ago

I'm not allowed to disclose the exact figures as, probably, you're not too. We both have to deal with this great question mark. ;)

aleqx commented 5 years ago

I don't have restrictions, but I'm not curious about your numbers so no worries there. ASICs will own the majority of hashrate soon anyway.

I personally have a proxy which actually broadcasts - in local LAN - jobs to a modded version of ethminer which uses UDP for work dispatching

So you invested time in modding and coding on UDP with no measurable performance benefits. You must have lots of free time on your hands - I envy you (no sarcasm here). I coded my own stratum proxy too (for more than just Eth) mainly because at the time I did it there was no good alternative ...

AndreaLanfranchi commented 5 years ago

So you invested time in modding and coding on UDP with no measurable performance benefits.

This is a free assumption without any basis.

aleqx commented 5 years ago

Ah, come on, you can't seriously claim that using UDP locally has any measurable performance benefits, i.e. measurable higher average valid share rate for the cluster, i.e. measurable higher profits. You may have had other reasons for coding in UDP, sure.

AndreaLanfranchi commented 5 years ago

measurable performance benefits

It has

higher average valid share

It has

measurable higher profits

... by consequence of the above ... it has.

You may have had other reasons for coding in UDP

Yes there are additional reasons.

aleqx commented 5 years ago

Having coded proxies for mining myself, and measured less than 1ms pushing mining.notify messages over tcp to 500+ different miners (that's rtt, ack included), I remain doubtful of those claims. I presume you can't offer proof :)

I'm talking about performance benefits of udp over tcp for a mining proxy (I'm not debating the advantages of a proxy).

AndreaLanfranchi commented 5 years ago

I presume you can't offer proof :)

In fact I cant'. It's been a payed job owned by the payee. All stats and instrumentation are under NDA.

I remain doubtful of those claims.

Nothing better than try it yourself.

AndreaLanfranchi commented 5 years ago

Have a nice weekend. I'm leaving till monday (hopefully).

aleqx commented 5 years ago

It's been a payed job owned by the payee

Yeah, I did commissioned work too for various things.

Nothing better than try it yourself.

Shaving off <1ms made absolutely no measurable difference in my case (I achieved that using other means). A proxy has a ton of other advantages though.

Have a nice weekend. I'm leaving till monday (hopefully).

It was nice to chat! And thanks for the pointers - I'll eventually try ethminer with grouped GPUs. I predict I'll see the same stale shares behavior and then I'll be stumped :). I'll let you know.

Have a great weekend break!

ddobreff commented 5 years ago

you can’t talk about connection using udp, you simply push and hope for the best.

AndreaLanfranchi commented 5 years ago

you can’t talk about connection using udp

No one did. We we're discussing of broadcasting vs serialized process of all tcp sockets.

aleqx commented 5 years ago

I'll eventually try ethminer with grouped GPUs. I predict I'll see the same stale shares behavior and then I'll be stumped :)

I geared up to record A/B measurements (so I can tell @AndreaLanfranchi that he's wrong) and I let 0.16.1 run overnight first with 1 miner/gpu for about 6h. I was just about to switch to grouped gpus only to discover that ethermine now reports 2% stales (no longer 7%) but also slightly higher valid shares than usual ...

... which totally invalidates the entire complaint about 0.16.1 and the previous A/B tests, and doesn't allow me to tell @AndreaLanfranchi that he's wrong :laughing:

Stale ratio tests are an absolute pest. I wish this chain was more stable.

I can't explain how 2 days ago the A/B tests seemed so conclusive (apart from coincidence in chain fluctuations).

Now that we discussed proxies, here's how I have been testing:

With this setup, 2 days ago when I opened this issue, when I was changing to 0.16.1 the stale ratio was climbing to 7% within 1 hour (ethermine's window); switching back to 0.12 the stale ratio was coming down to 1-2% also within 1h. I did this test 3 times, always with the same result!

Now it's all gone ... I can only blame chain fluctuations and an astonishing sequence of coincidences. @AndreaLanfranchi do you have any other explanation why the previous A/B test looked so conclusive but not now?

I'll let it run for longer with 1miner/gpu before testing grouped gpus.

AndreaLanfranchi commented 5 years ago

... which totally invalidates the entire complaint about 0.16.1 and the previous A/B tests, and doesn't allow me to tell @AndreaLanfranchi that he's wrong 😆

what a relief !

do you have any other explanation why the previous A/B test looked so conclusive but not now?

For instance : the nature of an A/B test is to compare the effectiveness of two different versions of the same "object" tracking the known changes of the smallest possible number of variables.

This is impossible (on ethminer) by definition even within the boundaries of the same version. Why ? Simple : every time an ethminer instance starts a non-deterministic randomizer assigns GPUs new search segments. So you may compare two - simoultaneously running - instances only if you're sure they both work on the very same search segment.

In 0.12 the segments assignement was, IMHO, weaker than it is now (0.14+) as the randomizer was ran per GPU on every new job (GPUs may overlap). See here. Assuming (on ethermine.org) an avg of 1 Job every 12 seconds, on a rig made up of 6 GPUs this implies 43200 randomizations (thus segments jumps) per day.

In 0.14+ we run the randomizer only once (at farm start) and assign all bound GPUs adjacent non overlapping segments (wich defaults to 2^40) starting from the randomized start nonce. See here

0.14+ has the advantage to guarantee each GPU bound to a single instance does not overlap the other(s) ... but as a drawback once it picks a segment ... it stays there forever (unless - since 0.16 - the API call miner_setscramblerinfo is invoked). If that segment results to be very crowded or overlapping with a significant number of other miners (worldwide) ... well you may end up experiencing an increased number of stale shares.

Restarting ethminer will re-initialize the randomizer and pick new segments. Thus restarting the very same version may give you 2% stales on a batch, then 5% stales on another, then again 1% on a third and so on. Unfortunately there is no "global pool" which can ensure all miners do not overlap each other. I meant to introduce the API call miner_shuffle to mitigate such problem without the need to restart the miner every time (thus saving all the time needed to rebuild the DAG).

The randomizer thing becomes irrelevant when connected to a pool which implements EthereumStratum/1.0.0 (NiceHash) as in that case start nonce is issued per miner by the pool itself: nevertheless even those pools do not have knowledge of other miner's segments in other pools or solo activities.

To sum up ... you can't know whether or not you're competing for the very same nonces with how many people.

Bottom line : new releases do not produce more stales per se (we've done a lot of efforts on async activities to ensure GPUs work at full throttle separated from IO activities). Stale ratio, on the other hand, is related to the search segment you land onto. Which is a function of the luck factor intrinsic in mining.

aleqx commented 5 years ago

Thanks for this. I didn't study the ethminer code well enough to realize that 0.12 randomizes on every job while 0.14+ only once. Why exactly did you guys opt for this change? Coming from a comms background, randomizing often leads to a better cost function (collision probability alone is a misleading factor).

It may explain why with 0.12 I virtually always end up converging to 1% stale ratio on average, while with 0.16.1 it's a matter of bad luck, especially when running so many individual sessions like I did (! miner/gpu). I guess I can mod the 0.16 code to randomize on every job if I wanted to.

what a relief !

Yes, the highlight of your day.

AndreaLanfranchi commented 5 years ago

Why exactly did you guys opt for this change?

Randomizing is a costy function. On the data exposed above a single rig looses 210ms (roughly) every day of hashing time. It's 1 minute 18 seconds a year per rig. Scale it to a large(!) facility and you may end-up loosing hours of mining.

Moreover the approach of a fixed segment eventually controlled by API calls helps greatly in clustering all miners in a single huge segment.

I guess I can mod the 0.16 code to randomize on every job if I wanted to.

Of course.

aleqx commented 5 years ago

Moreover the approach of a fixed segment eventually controlled by API calls helps greatly in clustering all miners in a single huge segment.

That's key ... without that, the cost of bad luck can much outweigh the benefits of less randomization. A good compromise can be reached. How about a command line option to allow the user to specify either:

I haven't looked at the code is a long while. Could I leave it as a suggestion for someone more familiar with the code to implement it? :)

AndreaLanfranchi commented 5 years ago

per-session randomization

Assuming every miner expect to work on stable pools with stable connection this does not make any sense. Unless you often restart your miner(s), and afford the cost of DAG being regenerated every time, a session has the lifecycle of an ethminer run. If your connection to pool changes frequently ... well you have other problems.

every N seconds randomization

You can easily build up scheduled cron jobs to invoke the proper API function at the rate you wish. No need to instantiate another timer, parse additional CLI arguments and validate em. There are still too many CLI arguments in ethminer imho.

echo '{"id":0,"jsonrpc":"2.0","method":"miner_shuffle"}' | netcat <your-miner-ip-address> 3333

that's it. Schedule it and you're done.

per-job randomization

I wouldn't do it for the reasons expressed above. Time lost in locking functions. You can easily workaround monitoring workers on pool's side and shuffle them when stale ratio stays high for long time. I personally think the 1 hour window expressed by ethermine.org is too short and undergoes too many spikes to be accurate.

AndreaLanfranchi commented 5 years ago

Could I leave it as a suggestion for someone more familiar with the code to implement it? :)

Of course you can ... open an issue with "Feature Request" template. Due to the nature of the project, however, do not expect a prompt implementation.

aleqx commented 5 years ago

per-session rand is what I understood ethminer does starting with 0.14, isn't it? i may have given it the wrong label, but i meant the current implement behavior. One option for that, one for per-job rand, one for time interval-based rand.

per-job rand ensures ergodicity (good!) while being hassle free (good!) and in my case it'll also cost me less than the bad luck of a whopping 6% stale rate I experienced 2 days ago. It would be great to have it as an option for whoever wants to use it.

Many thanks once again for the discussion!

AndreaLanfranchi commented 5 years ago

per-session rand is what I understood ethminer does starting with 0.14, isn't it?

Better define it "per-run". For me a session means a socket session (the lifecycle of a connection to a pool). A run can have multiple sessions.

aleqx commented 5 years ago

That's a good point. I added it to the suggestion list - can safely randomize whenever a connection or reconnection is established (it's not mining in between sessions anyway, so nothing is lost).

Thanks for the pointer to the API call. Something I can use in the meantime. Is ethminer using a different thread for API calls?

AndreaLanfranchi commented 5 years ago

About the opening topic of this thread (which has gone waaaayyy out the original contest) ... can we close ?

aleqx commented 5 years ago

Is there a way to split the discussion we had afterwards into another thread? That was more valuable and with a more useful outcome. If not, I'm inclined to change the topic's title to reflect the stale and randomization issues.

AndreaLanfranchi commented 5 years ago

Afaik it's not possible. Issues threads can't be split.

aleqx commented 5 years ago

Changed title and added note in the 1st post. Closing.

aleqx commented 5 years ago

Hmm, what happens when issuing this api call while it's searching? does it randomize the next search only or restarts searching when the api call is made? I guess I'll need to get my hands dirty in the end ...

AndreaLanfranchi commented 5 years ago

It will use the new nonce on next job