JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
774 stars 543 forks source link

at times the work is not restarted when getting a new block, resulting "job not found" shares #282

Closed rplant8 closed 3 years ago

rplant8 commented 3 years ago

broken in version 3.14.2 and up to now

approximately 10% of late shares in version 3.14.2 versus approximately 2% in version 3.14.1 still shows up well on a very fast sugarchain coin, how to repeat: cpuminer-sse2.exe -a yespower -o stratum+tcp://stratum-eu.rplant.xyz:7042 -u d.jaydee -D -P -K "Satoshi Nakamoto 31/Oct/2008 Proof-of-work is essentially one-CPU-one-vote" -N 2048 -R 32

Screenshot 2020-12-18 at 00 50 25
JayDDee commented 3 years ago

It seems to ocur when 2 jobs are received in quick succession, the second job is not detected. It looks like a mutex or synchronization problem. I will investigate.

rplant8 commented 3 years ago

yes, very similar to the mutex problem. still at times there is "share stats not available" and the latency time is spoiled. but this is rare and I can't diagnose it yet

rplant8 commented 3 years ago

it seems that it is necessary to show the time with milliseconds when debugging))

JayDDee commented 3 years ago

I'm considering millisecond timimg. I may also add "job not found" as stale instead of rejected.

In your session the stale share is submitted after the second mining.notify. I've never seen this before. In my test I got a couple of stales but the share was submitted before the new job. These can't be avoided.

But your data exposed a window between the miner thread copying new data and clearing the restart flag. If another stratum job is received in that window the miner thread clears the restart flag without knowing its newly copied data is already stale. It should be a simple fix, the restart flag should be cleared immediately after copying new data in case it gets set again before hashing is started. In fact it can be protected by mutex to fully close the window.

"Share stats not available" occurs when there are too many (8+) submitted shares pending achnowledgement. The queue overflows and data for unacknowledge shares is lost. This also causes the share counters to get out of sync. There is no synchronization between submitted and acknowledged shares so there's no way to associate an ack to a specific submit. I just match them up in the order they are received. One lost ack screws everything up, there's no way for the miner to detect it.

rplant8 commented 3 years ago

on the second point. increased the window to 32, does not help. but the stratum is within 3ms. and fast shares. occurs before a set of sufficient complexity. but it spoils all further statistics.

rplant8 commented 3 years ago

but I didn't understand the logic there at all. it seems to be a queue, but all operations go only with the last share, even if there is an unprocessed previous one?

JayDDee commented 3 years ago

Seperate put & get pointers point to the last share submitted or last share acked, respectively. If s_put_ptr races ahead and laps s_get_ptr the stats are lost.

JayDDee commented 3 years ago

I can't reproduce the problem but I have seen it in the past. That sychronization window has existed since before I forked cpuminer. The sequence is as follows:

  1. stratum thread receives new job 1, sets restart flags for all miner threads
  2. miner threads detect restart flag and abort current work
  3. miner threads copy new work for job 1
  4. stratum thread receives new job 2, sets restart flags for all miner threads
  5. miner threads clear their own restart flag
  6. miner threads start hashing job 1
  7. restart flag is clear until job 3 is received so keep hashing

The miner threads never see job 2 and continue hashing job 1 until job 3 is received.

The fix is to combine steps 3 & 5 as one atomic operation. This will ensure the restart flags are always synchronized with the state of the miner threads current job. If a second new job is received before hashing is started on the first the miner threads will each abort after the first hash to get the latest job.

The new sequence is:

  1. stratum thread receives new job 1, sets restart flags for all miner threads
  2. miner threads detect restart flag and abort current work
  3. miner threads copy new work for job 1 and clear restart flag
  4. stratum thread receives new job 2, sets restart flags for all miner threads
  5. miner threads start hashing job 1
  6. miner threads detect restart flag and abort current work
  7. miner threads copy new work for job 2 and clear restart flag
  8. miner threads start hashing job 2

I have the fix coded but testing is inconclusive because I can't reproduce the problem at will.

rplant8 commented 3 years ago

I have a stable reproduction in the first hundred shares. in the screenshots, version 3.15.4. can I test the fix?

Screenshot 2020-12-19 at 00 21 57 Screenshot 2020-12-19 at 00 19 13
JayDDee commented 3 years ago

Wow, new job and submitted share within 1 ms. As long as the mining.notify is before the share is submitted it's is a valid reproduction.

Here's a diff of cpu-miner.c, it's good enough for testing most algos. If just moves the clearing of the restart flag into the mutex region for get_new_data.

I'm trying to find way to make it readable, maybe an email?

rplant8 commented 3 years ago

markup warps everything

JayDDee commented 3 years ago

Did you get it? It's effectively moving 3 lines of code using comments for the deletions.

rplant8 commented 3 years ago

testing

rplant8 commented 3 years ago

it didn't help

Screenshot 2020-12-19 at 02 06 28
JayDDee commented 3 years ago

Thanks. I need to investigate more.

JayDDee commented 3 years ago

Is it possible 2 jobs have the same ntime? If so the stratum thread wouldn't recongize the second one as new work. I can provide debug code to test.

rplant8 commented 3 years ago

yes, ntime is the same. a second is too short for such a fast coin)

rplant8 commented 3 years ago

you turned off the clean job check a long time ago, this looks like a rudiment: / / Safer than testing the job id bool new_job = *get_stratum_job_ntime () != g_work->data[ algo_gate.ntime_index ];

maybe go back to clean job? ntime just does not play a role

JayDDee commented 3 years ago

Ok, it now makes sense. I don't like using the jobid to test for new work, it doesn't work properly with some algos. I thought ntime was good enough, but it isn't.

It's always been a clumsy test. It needs a rewrite to be triggered anytime a mining.notify is received.

JayDDee commented 3 years ago

clean_job only applies to a new job id. If clean_job not set the miner may continue mining the old job. All jobs are assumed clean, ie start using the new job data immediately.

rplant8 commented 3 years ago

I think we should check more. first clean job, if stratum does not ask for a restart, then we probably continue. but it doesn't hurt to check at least the block number, bits, and ntime. many stratums do not know how to properly clean job. I don't know what else

rplant8 commented 3 years ago

if clean job == false, then check the block number and nbits just in case. if not changed then continue the old job. it is possible to check ntime too

rplant8 commented 3 years ago

if clean job == true, then restart is better. what if stratum is buggy and doesn't accept old jobids even if nothing has changed

JayDDee commented 3 years ago

I don't know what you are trying to say. I think you might have things backward. This is how the clean job check works:

if ( new jobid && clean job ) get new work else // same job id or unclean new job, continue mining old work

One question I have is if the stratum server can send 2 mining.notify using the same jobid? The protocol logs suggest not, I don't see any.

With tight coupling between mining.notify and new jobs I may have a reliable solution.

rplant8 commented 3 years ago

you write correctly, only now there is no check for clean job in the miner. I haven't seen any jobid duplicates. it must be a very wrong stratum, there is no place for such in this world

rplant8 commented 3 years ago

a preliminary summary. I put a check on stratum.job.clean, no more missing blocks

for a good reason, in any case, you need to check all input parameters except jobid. you can optimize, everything is too expensive to compare. restart if something has changed, otherwise just change the jobid. true clean job can lie, for example, in many versions of nomp. the stratum will take the same share since the jobid has changed, but we calculate the same thing several times

JayDDee commented 3 years ago

Unclean jobs was never implemented AFAIK, so the're all clean. That's why I removed it. Testing the data is indirect, I want a guaranteed way to know that a mining.notify was received. I tested the following and got 7 stales over 100 shares but they were all due to network latency (new job received after share submitted). All mining.notify were followed by "threads restarted".

miner.h line 448 add new_job member to stratum context struct

struct stratum_ctx { char *url;

CURL curl; char curl_url; char curl_err_str[CURL_ERROR_SIZE]; curl_socket_t sock; size_t sockbuf_size; char *sockbuf; pthread_mutex_t sock_lock;

double next_diff; double sharediff;

char session_id; size_t xnonce1_size; unsigned char xnonce1; size_t xnonce2_size; struct stratum_job job; struct work work attribute ((aligned (64))); pthread_mutex_t work_lock;

int block_height; bool new_job; } attribute ((aligned (64)));

util.c line 2175 set new_job when handling mining.notify

if (!strcasecmp(method, "mining.notify")) { ret = stratum_notify(sctx, params); sctx->new_job = true; goto out; }

cpuminer.c line 1993 reset new_job when copying to g_work

bool new_job;

// Safer than testing the job id // bool new_job = *get_stratum_job_ntime() // != g_work->data[ algo_gate.ntime_index ];

pthread_rwlock_wrlock( &g_work_lock ); pthread_mutex_lock( &sctx->work_lock );

new_job = sctx->new_job; sctx->new_job = false;

cpuminer.c line 2770 test new_job to call stratum_gen_work

// if ( stratum.job.job_id && ( !g_work_time // || ( *get_stratum_job_ntime() // != g_work.data[ algo_gate.ntime_index ] ) ) ) if ( stratum.new_job ) stratum_gen_work( &stratum, &g_work );

rplant8 commented 3 years ago

I did this. a crutch, but it helped. it is necessary to think carefully about how to do it correctly and universally. stratums can't always be trusted. Thanks for the help

Screenshot 2020-12-19 at 05 53 38
JayDDee commented 3 years ago

That if expression is always true because every job is clean. You're calling stratum_gen_work every loop iteration for any method. It should only be called after handling a mining.notify method. My code above does that.

rplant8 commented 3 years ago

yes, the hack turned out to be dirty. we need to redo it, but it's still better than it was

JayDDee commented 3 years ago

What don't you like about my proposal?

rplant8 commented 3 years ago

my pool has clean implemented) I didn't know that everyone ignored the standard

JayDDee commented 3 years ago

I still don't know what you're talking about. Clean jobs is not an issue. The miner can ignore the feature and assume all jobs are clean, ie switch to new jobs immediately. There's nothing wrong with that and it has nothing to do with the problem reported in this issue.

I'm just trying to fix the problem, not rewrite the entire stratum client.

The problem originated when I made a change to focus on ntime as the trigger instead of jobid. That was wrong but I made the change because checking jobid has problems on some algos.

I think I have a solution but you seem to be ignoring it and are are obsessed with clean jobs.

JayDDee commented 3 years ago

In testing my latest solution there was one instance of 2 mining.notify messages timestamped to the same second followed by 2 quick share submissions. Both used the correct jobid and were accepted. This is as close as I've come to reproducing the conditions that caused the second job to be ignored.

Technically the jobid would be preferred to detect new work but that involves clumsy string compares. Using mining.notify is a more reliable way to ensure that no jobs are missed.

Both fixes will be in the new release. The lack of mutex protecting the restart flags may not have been the cause of this issue but still had the potential to ignore a new job if it arrived immediately after the previous one.

Do you have any questions before I submit?

rplant8 commented 3 years ago

IMHO if clean == false, we continue to dig if the pool does not support cleanjob, then it does not matter

if clean == true

  1. we can stupidly make a restart, in any case it will work. the simplest option cons - if the pool does not know how to cleanjob and the block has not changed, then we recalculate the same thing several times, the accepted shares on the pool will be the same, but there will be fewer blocks found (thinking, am I right?)

  2. an easy option. check the block number, bits and ntime. if not changed then change the current jobid? and keep counting. is it possible? there will be no blocking?

  3. heavy version. check all parameters from notify. if not changed then change the current jobid? and continue to count further.

between 2 and 3 options are possible, it still seems to me that 2 should work

in any case, you need to send with the last jobid, otherwise stratum will not accept

rplant8 commented 3 years ago

sugarchain is an extreme option, there will always be 1-2 options on it

rplant8 commented 3 years ago

option 2.5. compare block number, ntime, bits, string lengths, and transaction count

rplant8 commented 3 years ago

although the block number and ntime will probably be enough. when changing nbits, ntime should also change

JayDDee commented 3 years ago
  1. If we get a new unclean job it should have different data in the blockheader otherwise it would be a duplicate. The old way was to check the entire block header except for the nonce, a 76 byte memcmp. I tried many things to make it faster including checking for clean jobs to reduce the number of thread restarts. My decision about clean bit came down to risk. If I continue hashing the old job id when I should have switched I will submit rejects. If I switch to the new job when I could have continued I restart the miner threads for nothing but no harm is done and the time cost is less than one hash. In my testing when I was playing with clean job bit I never saw an instance where I could ignore a job. I don't see any useful purpose in it.

  2. block number only changes when there is a new block. Most coins are engineered for a 5 minute average block time. nbits only changes when the network difficulty changes and it changes less than the block number. I thought ntime was the most reliable single indicator of new data, but Sugarchain proved me wrong.

My reasoning for the chosen solution is based on the assumption the server never sends duplicate jobs so every mining.notify represents a new job that must be used immediately. Is that assumption incorrect?

rplant8 commented 3 years ago

i'll probably agree. i think the gain will be minimal in the case of a clever analysis of jobs

JayDDee commented 3 years ago

Did some testing of sugar at zergpool, got three blocks the same second followed by accepted share. This increases my confidence in the fix.

I also noticed that sugar doesn't produce multiple jobs for the same block, all jobs are new blocks. This is not the case for most coins which is why using the block number isn't reliable for new job detection.

[2020-12-20 14:11:01] New Block 8375953, Job 2b08b Diff: Net 0.011881, Stratum 0.5064, Target 7.7271e-06 TTF @ 2322.80 h/s: Block 6h06m, Share 0m14s Net hash rate (est) 12.76 Mh/s [2020-12-20 14:11:01] New Block 8375954, Job 2b08c Diff: Net 0.011868, Stratum 0.5064, Target 7.7271e-06 TTF @ 2322.80 h/s: Block 6h05m, Share 0m14s Net hash rate (est) 12.74 Mh/s [2020-12-20 14:11:01] New Block 8375955, Job 2b08d Diff: Net 0.011863, Stratum 0.5064, Target 7.7271e-06 TTF @ 2316.76 h/s: Block 6h06m, Share 0m14s Net hash rate (est) 12.74 Mh/s [2020-12-20 14:11:06] 24 Submitted Diff 2.6607e-05, Block 8375955, Job 2b08d [2020-12-20 14:11:06] 24 Accepted 24 S0 R0 B0, 45.794 sec (552ms)

JayDDee commented 3 years ago

v3.15.5 is released with 2 fixes discussed above. Let me know if there are any more problems with stale shares.

JayDDee commented 3 years ago

No problems reported, closing.