JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
763 stars 541 forks source link

Crash on startum share reject (v. 3.20.3+) #399

Closed YetAnotherRussian closed 9 months ago

YetAnotherRussian commented 10 months ago

Sad week full of bugs :(

v. 3.20.0 is working OK, sadly it seems to be the latest one which is unaffected v. 3.20.3+ got a crash

Earlier versions, e.g. from v3.17.1:

image

From 3.23.0:

image

I see nothing strange here, hmmm

I've tried to emulate rejects @ random third-party pool using wrong algo, but it works!

YetAnotherRussian commented 10 months ago

UPD:. Heh, it seems to be a bug in my sofware, as the job changes in between notify and submit (pool side). I call some kinda method that modifies the current job which miners are still working on... I'll have to fix this myself.

The question is - why cpuminer-opt is crashng in the latest versions?

I also use closed-source mining software, it doesn't crash.

JayDDee commented 10 months ago

It looks like it's crashing on the reject reason. That code was last changed in 3.20.1 and modified the logging of low difficulty shares. Immediately following the Rejected log the reason text is parsed for "difficulty" to report the share & target difficulty instead of the hash values. Everything points to that code but I don't see why it crashes. The reason text is clearly "low difficulty share" as displayed in the protocol log and there's no other apparently risky code.

Coincidentally I saw some low diff shares during testing this week, no crash.

Edit: for testing you can force low difficulty shares to be submited by modifying the diff_factor: "-m 1.2" should have around 20% rejected.

You can also use "--quiet" without "-D -P" so the reject reason won't be displayed and should prevent the crash or point in another direction.

Edit: also coincidentally the low diff rejects I saw were with sha256dt, yours are with sha256d. https://github.com/JayDDee/cpuminer-opt/issues/344#issuecomment-1704569305 The rejects I saw were when testing changes to to the early exit decision. They were also reported as low diff buf the diff was clearly higher than the target (hash value lower than target hash). They only occured with sha256dt, not sha256t and I haven't found an explanation.

JayDDee commented 10 months ago

This is getting weird. I can prevent the low difficulty shares by reverting the sha256 optimization but I still have no idea why they're happening. The crash trying to report a low diff reject is also a mystery, I don't see any crashes in my testing.

YetAnotherRussian commented 10 months ago

Please leave this one for a bit, I'll make gdb report in a day and a half.

JayDDee commented 10 months ago

I suggest crash testing with -t 1 to avoid any parallel output confusion.

I may have found the bug causing the low diff shares with sha256*. I need to do a lot more testing.

Edit: Your low diff share with sha256d isn't related to the bug I'm fixing, sha256d still uses the Pooler ASM code for AVX2. Something in your environment may have caused it and that may also be why the miner crashed. The associateion with the low diff log may just be coincidental. If you're comparing other miners you should also include TPruvot cpuminer-multi. No updates for a few years but the old algos are supported, has windows binaries and can also be built from source on Windows.

JayDDee commented 10 months ago

I'm ready with a fix for #344. The fix is better than I expected. Edit: except for SSE2/AVX.

If nothing comes up that addresses your current issues I'll go ahead with the release.

JayDDee commented 9 months ago

Unexpectedly I'm planning yet another release in the next couple of days. If you have anything that needs to be addressed let me know.

YetAnotherRussian commented 9 months ago

You've made this realease faster than I got home to gdb ;)

So I've started with cloning a fresh release from the repo and I can confirm there's no low diff shares for SHA256D anymore...

But, the first thing I've tried was using -a sha256dt for a sha256d stratum pool with the latest release. I got this:

image

So you could easily emulate a crash using wrong algo at a low stratum diff

JayDDee commented 9 months ago

At first glance it looks like yet another alignment issue, note "__strchr_avx2". Try compiling with "-O2" to see if it still crashes.

I've been doing a lot of testing with all sha256 d/t/dt but your crash is in protocol code so it's not relevant. Edit: my testing included a lot of low difficulty shares but none crashed. The reason pointer returned by JSON was likely not aligned and crashed due to the compiler's avx2 optimization. I'll have to search to see if it's an internal thing or if it was from a callback.

Editiorial: if the compiler can't guarantee alignment of a pointer it should not vectorize it.

Update:

As feared the mem is dynamically allocated which make it trickier. I found a callback that allocated a "all_data" buffer which looks like it may be the one of interest, I've changed it to use _mm_malloc instead. Even if this memory is misaligned aligning it may not solve the problem if the reason offset is not naturally not aligned relative to the start of the buffer. Edit: on second thought the compiler isn't stupid enough to vectorize an offset pointer, it should only try to vectorize head pointers so this may work.

I suspect the root cause may be build or Windows related. How did you build?

Update: nevermind the following, the code never gets hit. I tried another callback and it never got hit either. The data comes from json but I have no idea where the memory is allocated.

Here's the diff f you want to play with the code...

diff util.c ../cpuminer-opt-3.23.1/util.c 31d30 < #include <mm_malloc.h> 33d31 < 404,405c402 < if ( reqalloc > db->allocated ) < { --- > if (reqalloc > db->allocated) { 425,431c422,424 < // replace realloc with _mm_malloc < if ( ! ( newmem = _mm_malloc( newalloc, 64 ) ) ) return 0; < if ( db->buf ) < { < memcpy( newmem, db->buf, db->len ); // copy existing data < free( db->buf ); < } --- > newmem = realloc(db->buf, newalloc); > if (!newmem) > return 0;

Tip: misalignment is random, make sure you can reproduce the crash before & after testing the new code.

JayDDee commented 9 months ago

I'm stuck right now. The best I can come up with is a debug log to display the reason pointer to confirm it's alignment:

cpuminer:share_result if ( opt_debug ) applog(LOG_NOTICE,"reason: %p, %s", reason, reason ); if ( strstr( reason, "difficulty" ) ) applog2( LOG_MINR, "Share diff: %.5g, Target: %.5g", my_stats.share_diff, my_stats.target_diff );

Update:

cpuminer includes jansson code but doesn't use it, not when I compile it as I have the jansson packages installed. The main point is it does malloc of structs and in the included code there is no alignment directives nor any kind of aligned malloc. I reiterate my editorial from above, the compiler should never vectorize any accesses to memory referenced by these pointers.

I'll look into "#pragma"s that might be able to disable "-ftree-vectorize" for selected code segments. That may be the only way.

Edit: on the other hand I could copy the string to a local buffer that is alligned. That should solve it.

This might fix it:

cpu-miner.c:share_result // Silly workaround for ill advised compiler auto vectorization char s[64] __attribute__ ((aligned (32))); memcpy( s, reason, strlen( reason ) < 64 ? strlen( reason ) : 64 ); s[63] = '\0'; if ( strstr( s, "difficulty" ) ) applog2( LOG_MINR, "Share diff: %.5g, Target: %.5g", my_stats.share_diff, my_stats.target_diff );

Edit, even better: __attribute__ ((optimize(2))) static int share_result( int result, struct work *work, const char *reason )

YetAnotherRussian commented 9 months ago

Try compiling with "-O2" to see if it still crashes.

No changes, even with -O1

I'll look into "#pragma"s that might be able to disable "-ftree-vectorize" for selected code segments. That may be the only way.

As I see from GCC's docs "-ftree-vectorize" is enabled for "-O3" but disabled for "-O2"

JayDDee commented 9 months ago

The crash is in the string library so it may still use vectorized code even if the application is compiled without. The same thing occurs with Pooler ASM. it will use AVX2 if the CPU supports it even if compiled without it. (It would be nice if I could do that with cpuminer-opt, only one binary file would be required to support all architectures but it means rewriting all the hash code in ASM.)

I compiled cpu-miner.c to get an ASM listing using objdump. It's hard to read, the target names of function calls weren't displayed so I couldn't identify the call to strstr. There was some use of xmm registers but the only use of ymm was an unaligned load. All the vector code seemed related to floating point, nothing string related.

It's looking like solving the root problem is out of reach of cpuminer. I can remove the condition that's causing the crash and display the diff for every reject. Displaying useful info for rejects is still a work in progress but it looks like keying on the reject reason text to choose what to display is not going to work.

I could still try the local aligned buffer workaround but it's still trial and error. Can you try it so I know it works before I make it permanent? If not I'll remove the condition and display diff for all rejects.

edit: I did a quick test forcing low diff and I got a misaligned pointer (bit 4 == 1) but it didn't crash.

[2023-09-18 11:43:03] 1 Submitted Diff 0.49636, Block 398343, Job 216a4 [2023-09-18 11:43:04] 1 A0 S0 Rejected 1 B0, 4.978 sec (148ms) [2023-09-18 11:43:04] Reason: 0x7fec5c009530, Low difficulty share Reject reason: Low difficulty share Share diff: 0.49636, Target: 0.45

Maybe a compiler bug? What version of gcc do you use?

Here's the log I added if you want to try yourself: applog(LOG_NOTICE,"Reason: %p, %s", reason, reason );

Edit: Another thing you coud try is build without AVX2, just AVX & AES: CFLAGS="-O3 -march=corei7-avx -maes -Wall -fno-common" ./configure --with-curl

Edit: One question still bugs me: why has this not been seen before? I've never seen it and no one else has reported it. Something in your environment or something your are doing seems to be contributing to this.

Another edit: I did more testing with the windows build no crash, MSYS2 build mislagned pointer no crash. I can't reproduce it.

Yet another edit: I didn't see it mentioned earlier but I assume it crashes for you every time. If so it means it crashes even with an aligned pointer. It's looking more and more like something at your end.

One more thing: Try with --quitet, without --debug. That will skip the string test as well as displaying the reason log. I'm not sure I'll make any changes since the crash apparently only happens to you. Since I can't reproduce the crash with misaligned pointers the code seems to work as intended.

YetAnotherRussian commented 9 months ago

Another thing you coud try is build without AVX2, just AVX & AES: CFLAGS="-O3 -march=corei7-avx -maes -Wall -fno-common" ./configure --with-curl

No that doesn't work, as any of your prebuild binaries from sse2 to avx2-sha-vaes crash as well

Try with --quitet, without --debug. That will skip the string test as well as displaying the reason log.

That helped! And I finally got some info on a crash reason:

image

Seems that without --quiet option cpuminer-opt tries to show the reject reason and in that moment it gets "Accepted" (or empty reason) result for a share which was already rejected. This maybe leads to attempt to handle already null or empty reason or make a substring from that one?

That's from the server logs:

image

So any other mining software (wildrig, srbminer, cryptodredge - others untested) just some kinda kiks or skips those incoming messages and forces a share to be rejected.

I've already fixed that and I've left that old version. If you're interested, may fix and retest. If not, you can close :)

JayDDee commented 9 months ago

I think I see it now in the screenshot in the OP. The miner is getting 2 responses for every submitted share. That will mess things up pretty good. It's part of the lack of synchronization between submitted shares and their respective results response. They aren't evenhandled by the same thread, mining threads submit, stratum thread receives responses. If responses come out of order that will mess things up too. It's all very fragile, one lost message due to a network glitch and the stats queue gets corrupt. This lack of synchronization is also why I can't display the actual hash that corresponds to a share when it's rejected.

The crash looks like a mutex problem opening a window between checking the reason pointer for NULL and using it allowing it to be overwritten by the second response. I don't like mutex, it slows thing down, I don't want to add more for an externally created problem.

Since this is caused by receiveing bad data and the known problems this can cause, it's closing time.

Nevertheless the log will be changed to display difficulty for all rejects without scanning the reason text. I believe printf can handle NULL strings even though it seems strstr can't.