first node of dag generated from CUDA code are all zeros - Githubissues

ethereum-mining / ethminer

Ethereum miner with OpenCL, CUDA and stratum support

GNU General Public License v3.0

5.98k stars 2.29k forks source link

first node of dag generated from CUDA code are all zeros #1292

Closed ed2k closed 6 years ago

ed2k commented 6 years ago

during progpow development I noticed the generated dag from current CUDA code seems having a sequence of zeros in the beginning. can someone check this.

cpu implementation from ethash doesn't have this problem. Haven't checked opencl part.

jean-m-cyr commented 6 years ago

Hmm.. Had not noticed, but indeed the 1st 48 bytes (of 64) for DAG entry 0 are all zeroe'd.

I'd expect many more invalid shares to be generated if this was a problem?

ed2k commented 6 years ago

The chance of being used is slim if only 48 bytes out of 2.xGB is different (from cpu generated). Also very unlikely to also pass the verification from cpu code. Interesting problem though.

jean-m-cyr commented 6 years ago

Sure, about 1 in 40 million. But it's accessed at least 64 times per hash so we should hit it about every 700,000 hashes. It would be counted as a failed hash by the eval function, or rejected at the pool if noeval is used.

jean-m-cyr commented 6 years ago

It could cause a passing nonce to be rejected, or a failing nonce to be accepted. We'd only detect the latter.

jean-m-cyr commented 6 years ago

Can you point me to the CPU implementation you're using as reference.

ed2k commented 6 years ago

https://github.com/chfast/ethash

ed2k commented 6 years ago

must be due to shfl_sync in the cuda code. after I remove them, i.e back to one thread generate the complete 64bytes, problem solved.

jean-m-cyr commented 6 years ago

You going to submit a PR?

ed2k commented 6 years ago

let me think about for few days whether there is a performance gain on shfl_sync usage. the change without shfl_sync is here, https://github.com/ed2k/ethminer/commit/6fa16929c8e3883529f4c02b8276c11291176daf

jean-m-cyr commented 6 years ago

Works, but 30% slower.

jean-m-cyr commented 6 years ago

It is the last generate pass that clobbers dag entry[0]. Not clear yet but I suspect this will happen whenever dagsize is not a multiple of 4???

Hacked a fix here by running your one thread generate for the last pass, sync_shuffle for everything else.

Seems to do the trick, but there's got to be a more elegant way!

ed2k commented 6 years ago

the root cause is correct (not a multiple of 4). In this case last three indices with node_index 42336224, 42336225, 42336226 can not form a complete thread group (4 in total). When shfl_sync from the 4th (t==3), shuffle_index will become 0. So I guess the last 3 entries will also be wrong.

jean-m-cyr commented 6 years ago

@ed2k I'm not happy with current fix... got any better ideas?

ed2k commented 6 years ago

yes, idea is round up to 4 index, need time to do more testing. + if (((node_index/4)*4) > d_dag_size * 2) return;

+ if (shuffle_index <= d_dag_size * 2) + dag_nodes[shuffle_index].uint4s[thread_id] = s[thread_id];

jean-m-cyr commented 6 years ago

Make that: if (((node_index/4)4) >= d_dag_size 2) return;