Closed ed2k closed 6 years ago
Hmm.. Had not noticed, but indeed the 1st 48 bytes (of 64) for DAG entry 0 are all zeroe'd.
I'd expect many more invalid shares to be generated if this was a problem?
The chance of being used is slim if only 48 bytes out of 2.xGB is different (from cpu generated). Also very unlikely to also pass the verification from cpu code. Interesting problem though.
Sure, about 1 in 40 million. But it's accessed at least 64 times per hash so we should hit it about every 700,000 hashes. It would be counted as a failed hash by the eval function, or rejected at the pool if noeval is used.
It could cause a passing nonce to be rejected, or a failing nonce to be accepted. We'd only detect the latter.
Can you point me to the CPU implementation you're using as reference.
must be due to shfl_sync in the cuda code. after I remove them, i.e back to one thread generate the complete 64bytes, problem solved.
You going to submit a PR?
let me think about for few days whether there is a performance gain on shfl_sync usage. the change without shfl_sync is here, https://github.com/ed2k/ethminer/commit/6fa16929c8e3883529f4c02b8276c11291176daf
Works, but 30% slower.
It is the last generate pass that clobbers dag entry[0]. Not clear yet but I suspect this will happen whenever dagsize is not a multiple of 4???
Hacked a fix here by running your one thread generate for the last pass, sync_shuffle for everything else.
Seems to do the trick, but there's got to be a more elegant way!
the root cause is correct (not a multiple of 4). In this case last three indices with node_index 42336224, 42336225, 42336226 can not form a complete thread group (4 in total). When shfl_sync from the 4th (t==3), shuffle_index will become 0. So I guess the last 3 entries will also be wrong.
@ed2k I'm not happy with current fix... got any better ideas?
yes, idea is round up to 4 index, need time to do more testing.
+ if (((node_index/4)*4) > d_dag_size * 2) return;
+ if (shuffle_index <= d_dag_size * 2)
+ dag_nodes[shuffle_index].uint4s[thread_id] = s[thread_id];
Make that: if (((node_index/4)4) >= d_dag_size 2) return;
during progpow development I noticed the generated dag from current CUDA code seems having a sequence of zeros in the beginning. can someone check this.
cpu implementation from ethash doesn't have this problem. Haven't checked opencl part.