JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
763 stars 543 forks source link

Segfault on v3.20.2 and Ryzen 5 5500U #379

Closed slightlyskepticalpotat closed 1 year ago

slightlyskepticalpotat commented 1 year ago

I tried to compile the latest version of cpuminer-opt on Ubuntu 22.04 x86_64 with GCC 11.2.0. -march=native -Wall -O3 -march=znver2 -mvaes -Wall -O3 --march=znver2 -Wall -O3 --march=znver1 -Wall -O3 --march=znver3 -Wall All of them gave the following output when run:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 12:13:13] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 12:13:13] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 12:13:13] CPU affinity [!!!!!!!!!!!!]
Segmentation fault (core dumped)

Changing the thread count didn't help. I was trying to solo mine dogecoin as an experiment with --algo=scrypt. I later tried the same setup on a Ryzen 5 3500U, and everything worked.

JayDDee commented 1 year ago

Never mind, I can't to a full vector byte shuffle with _mm256_shuffle_epi8

-------ignore------ I just noticed an error in my bswap patch. I failed to notice that the destination target had the array elements reversed, counting down from 7 to 0.

That results in an invalid target. Utimately the loadu test was partially invalid. The segfault was successfuly prevented but the no blocks was due to a bug in the patch.

The test wasn't a big deal, more of a curiosity. However, it's easilly fixed if you want to repeat the test. A custom bswap macro is needed with the shuffle index changed to reverse the array elements as well as the bytes, effectively reversing the bytes accross the entire 256 bit vector.

Existing macro: #define mm256_bswap_32( v ) \ _mm256_shuffle_epi8( v, \ m256_const_64( 0x1c1d1e1f18191a1b, 0x1415161710111213, \ 0x0c0d0e0f08090a0b, 0x0405060700010203 ) )

New macro: #define mm256_bswap_32_test( v ) \ _mm256_shuffle_epi8( v, \ m256_const_64( 0x0001020304050607, 0x08090a0b0c0d0e0f, 0x1011121314151617, 0x18191a1b1c1d1e1f ) )

If you feel like repeating the test and have any questions, just ask. I don't see a lot of value in it, but like I said it's a curiosity. We already know loadu will prevent the segfault it's now only about finding blocks with the patch.

Edit: another change to the patch:

I'm concerned about using loadu for a local variable, explicit loads may interfere with the compiler's register optimizing. Since we know the segfault occured on the write we don't need to do a loadu, just let the compiler handle it.

_mm256_storeu_si256( (__m256i*)(work->target), mm256_bswap_32_test( *(__m256i*)target ) ) ); -------end ignore-----

JayDDee commented 1 year ago

That mess up with the byte order swapping made be realize the array reversal is unique in the miner. For stratum the target is calculated from the stratum difficulty so there's no byte swapping needed. That may explain some of the mystery.

Combining the array reversal with the byte swap is more complex to perform with vectors. I'm very surprised the compiller tried to do it, I'n not sure yet if I can do it. It will take more instructions. But I don't see how that would lead to a segfault.

slightlyskepticalpotat commented 1 year ago

Going to try this when I get home tonight.

On Mon., Aug. 29, 2022, 12:57 JayDDee, @.***> wrote:

That mess up with the byte order swapping made be realize the array reversal is unique in the miner. For stratum the target is calculated from the stratum difficulty so there's no byte swapping needed. That may explain some of the mystery.

Combining the array reversal with the byte swap is more complex to perform with vectors. I'm very surprised the compiller tried to do it, I'n not sure yet if I can do it. It will take more instructions. But I don't see how that would lead to a segfault.

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1230577771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYAEPGXKUYF24EJEZ3TV3TTWZANCNFSM57ZRU27Q . You are receiving this because you modified the open/close state.Message ID: @.***>

JayDDee commented 1 year ago

It won't work. I haven't figured out how to reverse the bytes accross the entire 256 bit vector. The target will be incorrect. I think it needs the new test bswap then another shuffle/permute to swap the 128 bit halves.

Edit: this might work

#define mm256_bswap_32_test( v ) \ _mm256_permute4x64_epi64( _mm256_shuffle_epi8( v, \ _mm256_set_epi64x( 0x0001020304050607, 0x08090a0b0c0d0e0f, 0x1011121314151617, 0x18191a1b1c1d1e1f ), 0x4e )

When operating at the byte level (assuming targets are byte arrays) the actual procedure becomes more obvious, no fancy byte swapping required:

for ( i = 0; i < 64; i++ ) work->target[ 63-i ] = target[ i ];

slightlyskepticalpotat commented 1 year ago

These pieces of code:

   __m256i x = mm256_bswap_32_test( _mm256_loadu_si256( (__m256i*)target ) );
   printf("ree\n");
   fflush(stdout);
   _mm256_storeu_si256( ( (__m256i*)(work->target)), x);
#define mm256_bswap_32_test( v ) \
   _mm256_permute4x64_epi64( _mm256_shuffle_epi8( v, \
         _mm256_set_epi64x( 0x0001020304050607, 0x08090a0b0c0d0e0f, \
                        0x1011121314151617, 0x18191a1b1c1d1e1f ), 0x4e )

Unfortunately only get me:

gcc -DHAVE_CONFIG_H -I.  -Iyes/include -fno-strict-aliasing  -I. -Iyes/include -Wno-pointer-sign -Wno-pointer-to-int-cast   -O3 -march=native -Wall  -Iyes/include -MT cpuminer-malloc-huge.o -MD -MP -MF .deps/cpuminer-malloc-huge.Tpo -c -o cpuminer-malloc-huge.o `test -f 'malloc-huge.c' || echo './'`malloc-huge.c
In file included from ./simd-utils.h:163,
                 from ./algo-gate-api.h:8,
                 from algo/sha/sha256d.h:1,
                 from cpu-miner.c:41:
cpu-miner.c: In function ‘gbt_work_decode’:
./simd-utils/simd-256.h:500:27: error: too many arguments to function ‘_mm256_shuffle_epi8’
  500 | _mm256_permute4x64_epi64( _mm256_shuffle_epi8( v, \
      |                           ^~~~~~~~~~~~~~~~~~~
cpu-miner.c:906:16: note: in expansion of macro ‘mm256_bswap_32_test’
  906 |    __m256i x = mm256_bswap_32_test( _mm256_loadu_si256( (__m256i*)target ) );

EDIT: Oops, that was obvious. Maybe I should only look at this in the mornings...

JayDDee commented 1 year ago

Missing closing bracket

define mm256_bswap_32_test( v ) \

_mm256_permute4x64_epi64( _mm256_shuffle_epi8( v, \ _mm256_set_epi64x( 0x0001020304050607, 0x08090a0b0c0d0e0f, \ 0x1011121314151617, 0x18191a1b1c1d1e1f ) ), 0x4e )

slightlyskepticalpotat commented 1 year ago

Interesting. Whether I put a printf in between the two lines of code in cpu-miner.c or not, this works perfectly.

JayDDee commented 1 year ago

That test has narrowed the problem to a misaligned access fault when writing the byte-swapped 256 bit vector back to memory. The only thing needed to fix it was to remove the alignment requirement by using the _mm256_storeu_si256 instead of _mm256_store_si256 or letting the compiler use an aligned store by doing a direct asignment..

You can confirm by removing the "u" to force an aligned store to see if the segfault comes back. Display the work->target pointer at the same time and it will prove the address was aligned and the misaligned fault is bogus.

You can easilly toggle back and forth and prove the CPU is improperly faulting an aligned access. You can do the same test on the 3500U and prove it doesn't fault.

BTW loadu/storeu just splits the memory access into multiple smaller chunks to avoid alignement issues, at significant performance penalty.

slightlyskepticalpotat commented 1 year ago
   __m256i x = mm256_bswap_32_test( _mm256_loadu_si256( (__m256i*)target ) );
   _mm256_storeu_si256( ( (__m256i*)(work->target)), x);

Gives 0x7ff550002150 0x7ff557021ce0 (no segfault)

   __m256i x = mm256_bswap_32_test( _mm256_loadu_si256( (__m256i*)target ) );
   _mm256_store_si256( ( (__m256i*)(work->target)), x);

Gives 0x7f4c88002150 0x7f4c8ffbcce0 (segfault)

You were right, the segfault returns when I remove the u despite the pointers apparently being aligned. I tried it on the 3500U and both of those do not segfault. Just wondering, how often do you hand-write vectorization code instead of letting the compiler optimise?

Incidentally, a warranty ticket for my 5500U laptop I put in a while ago has finally been processed. A usb port is busted, so they're going to replace the motherboard sometime. After that, I'm going to test to see if it also happens on another cpu of the same model.

JayDDee commented 1 year ago

Hash function vectorization operates on multiple parallel data streams so each lane is like a seperate thread. Compiler is limited to simpler stuff like fixed iteration loops with no dependencies and data copying. I was surprised the compiler was able to vectorize the bswap loop, but I guess it was looking specifically for inverting arrays.

JayDDee commented 1 year ago

I was reading a bit about AMD64 architecure, they actual had 64 bit before Intel, and how alignment actually works. It is indeed a processor exception rather than an MMU fault. Align Checking (AC) is programmable but different programming doesn't explain faulting a properly aligned access. For the programming to be different it would have to be assumed that the same OS would program the 3500U and 5500U differently. That seem very unlikely. It would also have to be assumed the compiler was oblivious to the AC setting and generated an aligned access without guarantying the address would be aligned when AC checking was being fullly enforced. And that would also have to assume the address was in fact misaligned.

Editted to remove the reference to zen2 architecture.

JayDDee commented 1 year ago

Incidentally, a warranty ticket for my 5500U laptop I put in a while ago has finally been processed. A usb port is busted, so they're going to replace the motherboard sometime. After that, I'm going to test to see if it also happens on another cpu of the same model.

This also gives AMD an opportunity to reproduce this problem on the very same CPU. BTW I opened a ticket with customer care for this segfault: 8201225820 You might want to link it to your ticket.

slightlyskepticalpotat commented 1 year ago

Just a correction, the 3500U is actually based on Zen+, not Zen 2. AMD naming conventions will never cease to surprise me 😅. I'm going to try to link your ticket to mine—you opened it with AMD, right?

JayDDee commented 1 year ago

I used the online support to fire off a question as a teaser to see if a human would pick it up. They sent me an email with a ticket number but no link to it. I'll let you know if I hear anything back.

I think my bit counting of alignment was wrong, The source pointer (target) is not aligned to 32 bytes but the destination work->target is. It doesn't matter much because the fault is on the destination pointer.

JayDDee commented 1 year ago

Got a reply from AMD, it's being escalated to an "expert".

JayDDee commented 1 year ago

Just a correction, the 3500U is actually based on Zen+, not Zen 2. AMD naming conventions will never cease to surprise me sweat_smile. I'm going to try to link your ticket to mine—you opened it with AMD, right?

I just had a thought about this. AFAIK Zen+ has a different AVX2 implementation, the same as Zen (1). AVX2 (256 bit ) operations are executed as two AVX (128 bit) operations. Zen2 implemented full 256 bit wide execution units. This could effectively reduce the required data alignment for AVX2 on Zen+ and could partially explain the different behaviour on the two CPUs. This is just speculation, I look forward to the AMD experts explaining what's really happening.

slightlyskepticalpotat commented 1 year ago

Interesting...I don't have any experience with AMD's support system but I hope their "experts" are better than Apple's "geniuses".

JayDDee commented 1 year ago

Reply from AMD. They want a service request from you. Let me know if you want any help with the information requested. You can also keep me in the loop as I will be able to better answer their questions about cpuminer-opt.


Dear Jay,

Your service request : SR #{ticketno:[8201225820]} has been reviewed and updated.

Response and Service Request History:

Thank you for your email.

We'd be happy to investigate this issue, however it would be easier to work with the user affected directly.

Please could you ask the user to open a service request here: https://www.amd.com/en/support/contact-email-form

Please provide the following information in the service request:

Description of the issue and a link to the Github page
Full System Specs - Including BIOS version
OS/Distribution Version/Kernel etc
System Name/Model (eg if a laptop what is the model and where was it purchased from)
dmesg log and similar logs from OS

Once we have that information, we will work with the user directly and investigate the issue that is seen with a segfault.

In order to update this service request, please respond without deleting or modifying the service request reference number in the email subject or in the email correspondence below.

Please Note: This service request will automatically close if we do not receive a response within 10 days and cannot be reopened.

If it is not feasible to respond within 10 days, feel free to open a new service request and reference this ticket for continued support.

Best regards,

Matt

AMD Global Customer Care

slightlyskepticalpotat commented 1 year ago

Thanks for letting me know, I'll put one in! Do you know how I should describe the issue more technically? I don't think "segfault" is going to cut it.

On Mon, Sep 5, 2022 at 8:05 AM JayDDee @.***> wrote:

Reply from AMD. They want a service request from you. Let me know if you want any help with the information requested. You can also keep me in the loop as I will be able to better answer their questions about cpuminer-opt.

_Dear Jay,

Your service request : SR #{ticketno:[8201225820]} has been reviewed and updated.

Response and Service Request History:

Thank you for your email.

We'd be happy to investigate this issue, however it would be easier to work with the user affected directly.

Please could you ask the user to open a service request here: https://www.amd.com/en/support/contact-email-form

Please provide the following information in the service request:

Description of the issue and a link to the Github page Full System Specs - Including BIOS version OS/Distribution Version/Kernel etc System Name/Model (eg if a laptop what is the model and where was it purchased from) dmesg log and similar logs from OS

Once we have that information, we will work with the user directly and investigate the issue that is seen with a segfault.

In order to update this service request, please respond without deleting or modifying the service request reference number in the email subject or in the email correspondence below.

Please Note: This service request will automatically close if we do not receive a response within 10 days and cannot be reopened.

If it is not feasible to respond within 10 days, feel free to open a new service request and reference this ticket for continued support.

Best regards,

Matt

AMD Global Customer Care_

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1236915627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYCVILP7JUEFDSVOYKLV4XO2LANCNFSM57ZRU27Q . You are receiving this because you modified the open/close state.Message ID: @.***>

JayDDee commented 1 year ago

"Segfault" is a good start because that is how the OS is reporting it. You can expand by explaining where the fault is occurring and how it was determined to actually be a misaligned fault and that the faulting address is in fact aligned to 32 bytes

slightlyskepticalpotat commented 1 year ago

Submitted—could you please let them know that my ticket is 8201227170?

On Mon, Sep 5, 2022 at 11:16 AM JayDDee @.***> wrote:

"Segfault" is a good start because that is how the OS is reporting it. You can expand by explaining where the fault is occurring and how it was determined to actually be a misaligned fault and that the faulting address is in fact aligned to 32 bytes

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1237194053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYG664PQUHFDBT44LSLV4YFGLANCNFSM57ZRU27Q . You are receiving this because you modified the open/close state.Message ID: @.***>

JayDDee commented 1 year ago

Done. That should close the loop so we are all informed. I'm reopening this issue since it's still active.

JayDDee commented 1 year ago

For reference here is a summary of the main points as I understand them at this point.

slightlyskepticalpotat commented 1 year ago

Update on the warranty situation: the entire motherboard was replaced, but the same error occurs.

On Mon, Sep 5, 2022 at 11:17 PM JayDDee @.***> wrote:

For reference here is a summary of the main points as I understand them at this point.

  • Two test laptop PCs, similar except for CPU generation. Target has 5500U, control has 3500U.
  • Testing uses same OS, Ubuntu-22.04, same compiler version, same compile options, same application source code, same application options.
  • Subject source code is a looped copy and byte order reversal of a 256 bit array composed of 8 32 bit integers.
  • Control never crashes.
  • Target crashes when compiled with auto-vectorization, otherwise works correctly.
  • Target crashes when array byte swap source code is replaced with AV2 intrinsics using aligned store _mm256_store_si256
  • Target does not crash and works correctly when using AVX2 intrinsincs with unaligned store _mm256_storeu_si256.
  • Displaying the faulting pointer with printf or gdb shows it always aligned to 32 bytes as required by AVX2.

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1237613627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYCEAU3NEIGSD6YDX3DV42ZWJANCNFSM57ZRU27Q . You are receiving this because you modified the open/close state.Message ID: @.***>

JayDDee commented 1 year ago

AMD is closing my ticket saying issue is resolved but wil work with you to find root cause of your issue. Pleas let me know what they find, if they find anything.

slightlyskepticalpotat commented 1 year ago

Just to let you know, they haven't responded to my ticket (8201227170) since they sent me an automated email saying it had been opened. Are you able to ask them what the status is from your closed ticket?

JayDDee commented 1 year ago

They told me to open a new ticket if I wanted further support so they'd likely ignore any queries about the old one.

I think no news is good news. That your ticket hasn't been closed yet is a good sign. There is always pressure to close tickets quickly to improve metrics. AMD techs are probably waiting to get their hands on the laptop. I expect a reply soon after because it will be a hot potato. What's in that reply will be the interesting part.

If you don't have a contact for your ticket you could use TECH.SUPPORT@amd.com. That was used by the "experts" for my ticket and I was able to reply.

slightlyskepticalpotat commented 1 year ago

Dear Anthony,

Your service request : SR #{ticketno:[8201227170]} has been reviewed and updated.

Response and Service Request History:

Sorry for the late response.

I have checked and there are no open issues with Segfault.

I cannot comment on opensource software however, I would recommend that the developer register on the AMD Developer site https://developer.amd.com/ and work directly with AMD developers or post details of the issue on the AMD Developer Communit https://community.amd.com/t5/newcomers-start-here/bd-p/newcomer-forumy (you will need to ask to be whitelisted).

On Sat, Sep 17, 2022 at 12:39 PM JayDDee @.***> wrote:

They told me to open a new ticket if I wanted further support so they'd likely ignore any queries about the old one.

I think no news is good news. That your ticket hasn't been closed yet is a good sign. There is always pressure to close tickets quickly to improve metrics. AMD techs are probably waiting to get their hands on the laptop. I expect a reply soon after because it will be a hot potato. What's in that reply will be the interesting part.

If you don't have a contact for your ticket you could use @.*** That was used by the "experts" for my ticket and I was able to reply.

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1250102314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYEKF5CLW7ISJCJT2OTV6XX2PANCNFSM57ZRU27Q . You are receiving this because you modified the open/close state.Message ID: @.***>

JayDDee commented 1 year ago

Disappointing but not entirely unexpected. Unfortunately AMD took the easy way out and blamed the software, ignoring the evidence to the contrary. There's nothing I can do because I don't own the CPU, or type of CPU, and can't reproduce the problem.

JayDDee commented 1 year ago

There has been another report of the same problem, #389, this time with an Intel CPU. This eliminates the CPU as the problem. Both users were using Ubuntu-22.04 and GCC-11.2 with points to a possible compiler problem.

slightlyskepticalpotat commented 1 year ago

Interesting. I will see if I can test it on a newer version of GCC sometime this week to see if they've fixed it since then.

JayDDee commented 1 year ago

I'm thinking of adding some debug code just for this issue. The code will be inserted just before the loop that crashes and will test the alignment of the target pointers before the crash. It will be compiled whenever AVX2 is present regardless of compiler optimization and is activated at run time with the --debug option.

I suggest adding it for testing. Feel free to make any modifications.

#if defined(__AVX2__) if ( opt_debug ) { if ( (uint64_t)target % 32 ) applog( LOG_ERR, "Misaligned target %p", target ); if ( (uint64_t)(work->target) % 32 ) applog( LOG_ERR, "Misaligned work->target %p", work->target ); } #endif

JayDDee commented 1 year ago

Some statistical observations:

There is a 50% random chance that any address will be aligned to 32 bytes or better. The absence of a crash is not conclusive. The crash seems to be consistent, so far, for a given environment (OS, compiler, CPU). Changing any variable could flip the random result, for example the CPU that now does not crash could crash when compiled with a different GCC version.

Test results need to be interpreted carefully. I hated statistics in school, so much uncertainty.