Closed rplant8 closed 3 years ago
It seems to ocur when 2 jobs are received in quick succession, the second job is not detected. It looks like a mutex or synchronization problem. I will investigate.
yes, very similar to the mutex problem. still at times there is "share stats not available" and the latency time is spoiled. but this is rare and I can't diagnose it yet
it seems that it is necessary to show the time with milliseconds when debugging))
I'm considering millisecond timimg. I may also add "job not found" as stale instead of rejected.
In your session the stale share is submitted after the second mining.notify. I've never seen this before. In my test I got a couple of stales but the share was submitted before the new job. These can't be avoided.
But your data exposed a window between the miner thread copying new data and clearing the restart flag. If another stratum job is received in that window the miner thread clears the restart flag without knowing its newly copied data is already stale. It should be a simple fix, the restart flag should be cleared immediately after copying new data in case it gets set again before hashing is started. In fact it can be protected by mutex to fully close the window.
"Share stats not available" occurs when there are too many (8+) submitted shares pending achnowledgement. The queue overflows and data for unacknowledge shares is lost. This also causes the share counters to get out of sync. There is no synchronization between submitted and acknowledged shares so there's no way to associate an ack to a specific submit. I just match them up in the order they are received. One lost ack screws everything up, there's no way for the miner to detect it.
on the second point. increased the window to 32, does not help. but the stratum is within 3ms. and fast shares. occurs before a set of sufficient complexity. but it spoils all further statistics.
but I didn't understand the logic there at all. it seems to be a queue, but all operations go only with the last share, even if there is an unprocessed previous one?
Seperate put & get pointers point to the last share submitted or last share acked, respectively. If s_put_ptr races ahead and laps s_get_ptr the stats are lost.
I can't reproduce the problem but I have seen it in the past. That sychronization window has existed since before I forked cpuminer. The sequence is as follows:
The miner threads never see job 2 and continue hashing job 1 until job 3 is received.
The fix is to combine steps 3 & 5 as one atomic operation. This will ensure the restart flags are always synchronized with the state of the miner threads current job. If a second new job is received before hashing is started on the first the miner threads will each abort after the first hash to get the latest job.
The new sequence is:
I have the fix coded but testing is inconclusive because I can't reproduce the problem at will.
I have a stable reproduction in the first hundred shares. in the screenshots, version 3.15.4. can I test the fix?
Wow, new job and submitted share within 1 ms. As long as the mining.notify is before the share is submitted it's is a valid reproduction.
Here's a diff of cpu-miner.c, it's good enough for testing most algos. If just moves the clearing of the restart flag into the mutex region for get_new_data.
I'm trying to find way to make it readable, maybe an email?
markup warps everything
Did you get it? It's effectively moving 3 lines of code using comments for the deletions.
testing
it didn't help
Thanks. I need to investigate more.
Is it possible 2 jobs have the same ntime? If so the stratum thread wouldn't recongize the second one as new work. I can provide debug code to test.
yes, ntime is the same. a second is too short for such a fast coin)
you turned off the clean job check a long time ago, this looks like a rudiment: / / Safer than testing the job id bool new_job = *get_stratum_job_ntime () != g_work->data[ algo_gate.ntime_index ];
maybe go back to clean job? ntime just does not play a role
Ok, it now makes sense. I don't like using the jobid to test for new work, it doesn't work properly with some algos. I thought ntime was good enough, but it isn't.
It's always been a clumsy test. It needs a rewrite to be triggered anytime a mining.notify is received.
clean_job only applies to a new job id. If clean_job not set the miner may continue mining the old job. All jobs are assumed clean, ie start using the new job data immediately.
I think we should check more. first clean job, if stratum does not ask for a restart, then we probably continue. but it doesn't hurt to check at least the block number, bits, and ntime. many stratums do not know how to properly clean job. I don't know what else
if clean job == false, then check the block number and nbits just in case. if not changed then continue the old job. it is possible to check ntime too
if clean job == true, then restart is better. what if stratum is buggy and doesn't accept old jobids even if nothing has changed
I don't know what you are trying to say. I think you might have things backward. This is how the clean job check works:
if ( new jobid && clean job ) get new work else // same job id or unclean new job, continue mining old work
One question I have is if the stratum server can send 2 mining.notify using the same jobid? The protocol logs suggest not, I don't see any.
With tight coupling between mining.notify and new jobs I may have a reliable solution.
you write correctly, only now there is no check for clean job in the miner. I haven't seen any jobid duplicates. it must be a very wrong stratum, there is no place for such in this world
a preliminary summary. I put a check on stratum.job.clean, no more missing blocks
for a good reason, in any case, you need to check all input parameters except jobid. you can optimize, everything is too expensive to compare. restart if something has changed, otherwise just change the jobid. true clean job can lie, for example, in many versions of nomp. the stratum will take the same share since the jobid has changed, but we calculate the same thing several times
Unclean jobs was never implemented AFAIK, so the're all clean. That's why I removed it. Testing the data is indirect, I want a guaranteed way to know that a mining.notify was received. I tested the following and got 7 stales over 100 shares but they were all due to network latency (new job received after share submitted). All mining.notify were followed by "threads restarted".
miner.h line 448 add new_job member to stratum context struct
struct stratum_ctx { char *url;
CURL curl; char curl_url; char curl_err_str[CURL_ERROR_SIZE]; curl_socket_t sock; size_t sockbuf_size; char *sockbuf; pthread_mutex_t sock_lock;
double next_diff; double sharediff;
char session_id; size_t xnonce1_size; unsigned char xnonce1; size_t xnonce2_size; struct stratum_job job; struct work work attribute ((aligned (64))); pthread_mutex_t work_lock;
int block_height; bool new_job; } attribute ((aligned (64)));
util.c line 2175 set new_job when handling mining.notify
if (!strcasecmp(method, "mining.notify")) { ret = stratum_notify(sctx, params); sctx->new_job = true; goto out; }
cpuminer.c line 1993 reset new_job when copying to g_work
bool new_job;
// Safer than testing the job id // bool new_job = *get_stratum_job_ntime() // != g_work->data[ algo_gate.ntime_index ];
pthread_rwlock_wrlock( &g_work_lock ); pthread_mutex_lock( &sctx->work_lock );
new_job = sctx->new_job; sctx->new_job = false;
cpuminer.c line 2770 test new_job to call stratum_gen_work
// if ( stratum.job.job_id && ( !g_work_time // || ( *get_stratum_job_ntime() // != g_work.data[ algo_gate.ntime_index ] ) ) ) if ( stratum.new_job ) stratum_gen_work( &stratum, &g_work );
I did this. a crutch, but it helped. it is necessary to think carefully about how to do it correctly and universally. stratums can't always be trusted. Thanks for the help
That if expression is always true because every job is clean. You're calling stratum_gen_work every loop iteration for any method. It should only be called after handling a mining.notify method. My code above does that.
yes, the hack turned out to be dirty. we need to redo it, but it's still better than it was
What don't you like about my proposal?
my pool has clean implemented) I didn't know that everyone ignored the standard
I still don't know what you're talking about. Clean jobs is not an issue. The miner can ignore the feature and assume all jobs are clean, ie switch to new jobs immediately. There's nothing wrong with that and it has nothing to do with the problem reported in this issue.
I'm just trying to fix the problem, not rewrite the entire stratum client.
The problem originated when I made a change to focus on ntime as the trigger instead of jobid. That was wrong but I made the change because checking jobid has problems on some algos.
I think I have a solution but you seem to be ignoring it and are are obsessed with clean jobs.
In testing my latest solution there was one instance of 2 mining.notify messages timestamped to the same second followed by 2 quick share submissions. Both used the correct jobid and were accepted. This is as close as I've come to reproducing the conditions that caused the second job to be ignored.
Technically the jobid would be preferred to detect new work but that involves clumsy string compares. Using mining.notify is a more reliable way to ensure that no jobs are missed.
Both fixes will be in the new release. The lack of mutex protecting the restart flags may not have been the cause of this issue but still had the potential to ignore a new job if it arrived immediately after the previous one.
Do you have any questions before I submit?
IMHO if clean == false, we continue to dig if the pool does not support cleanjob, then it does not matter
if clean == true
we can stupidly make a restart, in any case it will work. the simplest option cons - if the pool does not know how to cleanjob and the block has not changed, then we recalculate the same thing several times, the accepted shares on the pool will be the same, but there will be fewer blocks found (thinking, am I right?)
an easy option. check the block number, bits and ntime. if not changed then change the current jobid? and keep counting. is it possible? there will be no blocking?
heavy version. check all parameters from notify. if not changed then change the current jobid? and continue to count further.
between 2 and 3 options are possible, it still seems to me that 2 should work
in any case, you need to send with the last jobid, otherwise stratum will not accept
sugarchain is an extreme option, there will always be 1-2 options on it
option 2.5. compare block number, ntime, bits, string lengths, and transaction count
although the block number and ntime will probably be enough. when changing nbits, ntime should also change
If we get a new unclean job it should have different data in the blockheader otherwise it would be a duplicate. The old way was to check the entire block header except for the nonce, a 76 byte memcmp. I tried many things to make it faster including checking for clean jobs to reduce the number of thread restarts. My decision about clean bit came down to risk. If I continue hashing the old job id when I should have switched I will submit rejects. If I switch to the new job when I could have continued I restart the miner threads for nothing but no harm is done and the time cost is less than one hash. In my testing when I was playing with clean job bit I never saw an instance where I could ignore a job. I don't see any useful purpose in it.
block number only changes when there is a new block. Most coins are engineered for a 5 minute average block time. nbits only changes when the network difficulty changes and it changes less than the block number. I thought ntime was the most reliable single indicator of new data, but Sugarchain proved me wrong.
My reasoning for the chosen solution is based on the assumption the server never sends duplicate jobs so every mining.notify represents a new job that must be used immediately. Is that assumption incorrect?
i'll probably agree. i think the gain will be minimal in the case of a clever analysis of jobs
Did some testing of sugar at zergpool, got three blocks the same second followed by accepted share. This increases my confidence in the fix.
I also noticed that sugar doesn't produce multiple jobs for the same block, all jobs are new blocks. This is not the case for most coins which is why using the block number isn't reliable for new job detection.
[2020-12-20 14:11:01] New Block 8375953, Job 2b08b Diff: Net 0.011881, Stratum 0.5064, Target 7.7271e-06 TTF @ 2322.80 h/s: Block 6h06m, Share 0m14s Net hash rate (est) 12.76 Mh/s [2020-12-20 14:11:01] New Block 8375954, Job 2b08c Diff: Net 0.011868, Stratum 0.5064, Target 7.7271e-06 TTF @ 2322.80 h/s: Block 6h05m, Share 0m14s Net hash rate (est) 12.74 Mh/s [2020-12-20 14:11:01] New Block 8375955, Job 2b08d Diff: Net 0.011863, Stratum 0.5064, Target 7.7271e-06 TTF @ 2316.76 h/s: Block 6h06m, Share 0m14s Net hash rate (est) 12.74 Mh/s [2020-12-20 14:11:06] 24 Submitted Diff 2.6607e-05, Block 8375955, Job 2b08d [2020-12-20 14:11:06] 24 Accepted 24 S0 R0 B0, 45.794 sec (552ms)
v3.15.5 is released with 2 fixes discussed above. Let me know if there are any more problems with stale shares.
No problems reported, closing.
broken in version 3.14.2 and up to now
approximately 10% of late shares in version 3.14.2 versus approximately 2% in version 3.14.1 still shows up well on a very fast sugarchain coin, how to repeat: cpuminer-sse2.exe -a yespower -o stratum+tcp://stratum-eu.rplant.xyz:7042 -u d.jaydee -D -P -K "Satoshi Nakamoto 31/Oct/2008 Proof-of-work is essentially one-CPU-one-vote" -N 2048 -R 32