Segfault on v3.20.2 and Ryzen 5 5500U

slightlyskepticalpotat commented 1 year ago

I tried to compile the latest version of cpuminer-opt on Ubuntu 22.04 x86_64 with GCC 11.2.0. -march=native -Wall -O3 -march=znver2 -mvaes -Wall -O3 --march=znver2 -Wall -O3 --march=znver1 -Wall -O3 --march=znver3 -Wall All of them gave the following output when run:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 12:13:13] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 12:13:13] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 12:13:13] CPU affinity [!!!!!!!!!!!!]
Segmentation fault (core dumped)

Changing the thread count didn't help. I was trying to solo mine dogecoin as an experiment with --algo=scrypt. I later tried the same setup on a Ryzen 5 3500U, and everything worked.

JayDDee commented 1 year ago

I'll need to know where in the code it's crashing. Please add --debug and if you are familiar with gdb a backtrace would be helpfull. I'm concerned that it crashes on a more capable CPU. This is not typical of a SW issue or incompaible build. It's also crashing very early so it might not even be in the hash code yet. Try to reproduce with different algos. Scryptn2 should be included, it shares much code with the smaller scrypt but has a different mermoy profile. This will help identify if it's an algo issue.

All testing should be done using the default build, and please provide some more details about the faulting system, like the amount of RAM and any differences from the working system.

Edit: also since the issue is not thread related testing would be better with only one miner thread.

slightlyskepticalpotat commented 1 year ago

Alright, I think I'm going to exclusively use build.sh from now on. The faulting and working systems should have near-identical software (clean installs of 22.04), but the faulting system has 16gb of ram and the working system has 8. Secure boot is also enabled on the working system, but I don't think that's relevant.

Output with --debug:

$ ./cpuminer --algo=scrypt --url=http://127.0.0.1:44555 --user=user --pass=pass --coinbase-addr=[address] --debug --threads=1

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 13:15:15] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 13:15:15] Throughput 8/thr, Buffer 256 kiB/thr, Total 256 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 13:15:15] Coinbase address uses B58 coding
[2022-08-27 13:15:15] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 13:15:15] 1 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 13:15:15] Default miner thread priority 0 (nice 19)
[2022-08-27 13:15:15] Binding thread 0 to cpu 0
Segmentation fault (core dumped)

gdb output of the same:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 13:19:21] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 13:19:21] Throughput 8/thr, Buffer 256 kiB/thr, Total 256 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 13:19:21] Coinbase address uses B58 coding
[2022-08-27 13:19:21] CPU affinity [!!!!!!!!!!!!]
[New Thread 0x7ffff6870600 (LWP 50901)]
[New Thread 0x7ffff606f600 (LWP 50902)]
[2022-08-27 13:19:21] Default miner thread priority 0 (nice 19)
[2022-08-27 13:19:21] Binding thread 0 to cpu 0
[2022-08-27 13:19:21] 1 of 12 miner threads started using 'scrypt' algorithm

Thread 2 "cpuminer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6870600 (LWP 50901)]
0x000055555555f231 in ?? ()
(gdb) bt 10
#0  0x000055555555f231 in ?? ()
#1  0x0000555555564a0e in ?? ()
#2  0x00007ffff7565b43 in start_thread (arg=<optimized out>)
    at ./nptl/pthread_create.c:442
#3  0x00007ffff75f7a00 in clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

I don't think this is an algo issue—allium, x11, neoscrypt, scryptn2, and any other algorithm I try gives the same output.

slightlyskepticalpotat commented 1 year ago

Just remembered benchmark mode existed and tested with it, doesn't seem to be an algo issue:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 14:00:12] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 14:00:12] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 14:00:12] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 14:00:12] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 14:00:16] Total: 48.95 kH/s, Temp: 39C, Freq: 3.650/3.701 GHz
[2022-08-27 14:00:22] Total: 88.88 kH/s, Temp: 39C, Freq: 3.381/3.451 GHz

JayDDee commented 1 year ago

You're solo mining. I don't think it's the issue if it works on the 3500U but it gives another opportunity to narrow the crash location.

The only thing after the last message is calling thread_init which does nothing for most algos, then enters the loop and starts looking for work to hash. The next expected log is a new block report from GBT, stratum generates a different report and benchmark doesn't look for work, just makes up its own.

If you could test with stratum & benchmark the code would take different paths looking for work and might change the symptoms. Beyond that some additional debug messages can be added to zoom in on the exact code that's causing the crash.

However, from a higher level, the fact that it works on the other system indicates a problem specific to the one system. Possibly a corrupt miner or even the OS. I suggest downloading a fresh copy of cpuminer, or use the copy from the working system. Reinstalling the OS is another option.

Let me know if you're comfortable enough with code to add more debug messages with some coaching.

Edit: adding -P will produce protocol logs and may tell us if it's even trying to connect to the server.

Edit: I was starting to think it's an issue with solo mining. It's not well tested or mainained. You could try Tpruvot cpuminer-multi and/or Pooler cpuminer so see if either of them work. But that theory is shot down by the fact cpuminer-opt works on another system.

slightlyskepticalpotat commented 1 year ago

I think I'm going to try stratum as a starting point since that's better maintained. I think I could stumble through adding some debug messages to the code if you point me in the right direction, but would prefer not to start with that. As for the miner and the os, I've tried downloading cpuminer-opt several times (even the version before the latest version), and they showed similar issues. Going to also try it on a new Live USB to see if it's the OS.

Adding -P produces this. It looks normal to me up to the segfault, but hopefully you can make more sense of this than I can.

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 14:57:36] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 14:57:36] Throughput 8/thr, Buffer 256 kiB/thr, Total 256 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 14:57:36] Coinbase address uses B58 coding
[2022-08-27 14:57:36] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 14:57:36] 1 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 14:57:36] Default miner thread priority 0 (nice 19)
[2022-08-27 14:57:36] Binding thread 0 to cpu 0
[2022-08-27 14:57:36] JSON protocol request:
{"method": "getblocktemplate", "params": [{"capabilities": ["coinbasetxn", "coinbasevalue", "longpoll", "workid"], "rules": ["segwit"]}], "id":0}

*   Trying 127.0.0.1:44555...
* Connected to 127.0.0.1 (127.0.0.1) port 44555 (#0)
* Server auth using Basic with user 'user'
> POST / HTTP/1.1
Host: 127.0.0.1:44555
Authorization: Basic dXNlcjpwYXNz
Accept: */*
Accept-Encoding: deflate, gzip, br, zstd
Transfer-Encoding: chunked
Content-Type: application/json
Content-Length: 147
User-Agent: cpuminer-opt/3.20.2
X-Mining-Extensions: longpoll reject-reason
Expect: 100-continue

* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Signaling end of chunked upload via terminating chunk.
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Sat, 27 Aug 2022 18:57:36 GMT
< Content-Length: 635
< 
* Connection #0 to host 127.0.0.1 left intact
[2022-08-27 14:57:36] JSON protocol response:
{
   "result": {
      "capabilities": [
         "proposal"
      ],
      "version": 6422532,
      "rules": [],
      "vbavailable": {},
      "vbrequired": 0,
      "previousblockhash": "bbe9e18b65c42a0a4e7773fdb2ce7af303d42e86d8b79a4b4180c9ae47266372",
      "transactions": [],
      "coinbaseaux": {
         "flags": ""
      },
      "coinbasevalue": 1000000000000,
      "longpollid": "bbe9e18b65c42a0a4e7773fdb2ce7af303d42e86d8b79a4b4180c9ae472663722",
      "target": "00000fffff000000000000000000000000000000000000000000000000000000",
      "mintime": 1661625752,
      "mutable": [
         "time",
         "transactions",
         "prevblock"
      ],
      "noncerange": "00000000ffffffff",
      "sigoplimit": 20000,
      "sizelimit": 1000000,
      "curtime": 1661626656,
      "bits": "1e0fffff",
      "height": 4013362
   },
   "error": null,
   "id": 0
}
Segmentation fault (core dumped)

Thanks for all the help so far!

JayDDee commented 1 year ago

From the protocol logs I can tell that the server sent work and the miner crashed trying to decode it. I have no idea why that would happen on one system but not another. It's also crashing in GBT code so your stratum test might produce different results.

The focus for the GBT crash is on cpu-miner.c:get_upstream_work. That function sends the getblocktemplate request and procceses the result by calling gbt_work_decode then producing the new block log. This is the window where it crashes, and the place to put some debug printf as checkpoints to help narrow it down further.

I'l wait for the stratum test results, if it's reproduceable using stratum it will make troubleshooting easier.

slightlyskepticalpotat commented 1 year ago

Stratum seems to work, I let it run for a while and it was stable:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 19:12:52] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 19:12:52] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 19:12:52] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 19:12:52] Creating stratum thread
[2022-08-27 19:12:52] Stratum connect stratum+tcp://stratum.aikapool.com:7915
[2022-08-27 19:12:52] Threads restarted for new work.
[2022-08-27 19:12:52] Default miner thread priority 0 (nice 19)
[2022-08-27 19:12:52] Binding thread 0 to cpu 0
[2022-08-27 19:12:52] Thread 0 waiting for first job
[2022-08-27 19:12:52] Binding thread 1 to cpu 1
[2022-08-27 19:12:52] Thread 1 waiting for first job
[2022-08-27 19:12:52] Binding thread 2 to cpu 2
[2022-08-27 19:12:52] Thread 2 waiting for first job
[2022-08-27 19:12:52] Binding thread 3 to cpu 3
[2022-08-27 19:12:52] Thread 3 waiting for first job
[2022-08-27 19:12:52] Binding thread 4 to cpu 4
[2022-08-27 19:12:52] Thread 4 waiting for first job
[2022-08-27 19:12:52] Binding thread 5 to cpu 5
[2022-08-27 19:12:52] Thread 5 waiting for first job
[2022-08-27 19:12:52] Binding thread 6 to cpu 6
[2022-08-27 19:12:52] Thread 6 waiting for first job
[2022-08-27 19:12:52] Binding thread 7 to cpu 7
[2022-08-27 19:12:52] Thread 7 waiting for first job
[2022-08-27 19:12:52] Binding thread 8 to cpu 8
[2022-08-27 19:12:52] Binding thread 9 to cpu 9
[2022-08-27 19:12:52] Thread 9 waiting for first job
[2022-08-27 19:12:52] Thread 8 waiting for first job
[2022-08-27 19:12:52] Binding thread 10 to cpu 10
[2022-08-27 19:12:52] Thread 10 waiting for first job
[2022-08-27 19:12:52] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 19:12:52] Binding thread 11 to cpu 11
[2022-08-27 19:12:52] Thread 11 waiting for first job
*   Trying 84.234.52.190:7915...
* Connected to stratum.aikapool.com (84.234.52.190) port 7915 (#0)
* Connection #0 to host stratum.aikapool.com left intact
[2022-08-27 19:12:52] > {"id": 1, "method": "mining.subscribe", "params": ["cpuminer-opt/3.20.2"]}
[2022-08-27 19:12:52] < {"id":1,"result":[[["mining.set_difficulty","deadbeefcafebabe747c130000000000"],["mining.notify","deadbeefcafebabe747c130000000000"]],"780195aa",4],"error":null}
[2022-08-27 19:12:52] Stratum session id: deadbeefcafebabe747c130000000000
[2022-08-27 19:12:52] Stratum extranonce1 0x780195aa, extranonce2 size 4
[2022-08-27 19:12:52] > {"id": 2, "method": "mining.authorize", "params": ["user", "pass"]}
[2022-08-27 19:12:53] < {"id":null,"method":"mining.set_difficulty","params":[16384]}
[2022-08-27 19:12:53] < {"id":null,"method":"mining.notify","params":["5187","3838d1c26496b014b8928cb8f6d2e881fe7cd962067f377ab3496310b0c37b0f","01000000010000000000000000000000000000000000000000000000000000000000000000ffffffff20032ea04204eba40a6308","0d2f6e6f64655374726174756d2f00000000010010a5d4e80000001976a914f6c7f1c2cd06849dd836bb2f40244741dbc0c4fd88ac00000000",[],"00620004","1a03a131","630aa4eb",true]}
[2022-08-27 19:12:53] < {"id":2,"result":true,"error":null}
[2022-08-27 19:12:53] > {"id": 3, "method": "mining.extranonce.subscribe", "params": []}
[2022-08-27 19:12:53] Thread 0 waiting for first job
[2022-08-27 19:12:53] Thread 1 waiting for first job
[2022-08-27 19:12:53] Thread 2 waiting for first job
[2022-08-27 19:12:53] Thread 3 waiting for first job
[2022-08-27 19:12:53] Thread 4 waiting for first job
[2022-08-27 19:12:53] Thread 5 waiting for first job
[2022-08-27 19:12:53] Thread 6 waiting for first job
[2022-08-27 19:12:53] Thread 7 waiting for first job
[2022-08-27 19:12:53] Thread 9 waiting for first job
[2022-08-27 19:12:53] Thread 8 waiting for first job
[2022-08-27 19:12:53] Thread 10 waiting for first job
[2022-08-27 19:12:53] Thread 11 waiting for first job
[2022-08-27 19:12:54] Thread 0 waiting for first job
[2022-08-27 19:12:54] Thread 1 waiting for first job
[2022-08-27 19:12:54] Thread 2 waiting for first job
[2022-08-27 19:12:54] Thread 3 waiting for first job
[2022-08-27 19:12:54] Thread 4 waiting for first job
[2022-08-27 19:12:54] Thread 5 waiting for first job
[2022-08-27 19:12:54] Thread 6 waiting for first job
[2022-08-27 19:12:54] Thread 7 waiting for first job
[2022-08-27 19:12:54] Thread 9 waiting for first job
[2022-08-27 19:12:54] Thread 8 waiting for first job
[2022-08-27 19:12:54] Thread 10 waiting for first job
[2022-08-27 19:12:54] Thread 11 waiting for first job
[2022-08-27 19:12:55] Thread 0 waiting for first job
[2022-08-27 19:12:55] Thread 1 waiting for first job
[2022-08-27 19:12:55] Thread 2 waiting for first job
[2022-08-27 19:12:55] Thread 3 waiting for first job
[2022-08-27 19:12:55] Thread 4 waiting for first job
[2022-08-27 19:12:55] Thread 5 waiting for first job
[2022-08-27 19:12:55] Thread 6 waiting for first job
[2022-08-27 19:12:55] Thread 7 waiting for first job
[2022-08-27 19:12:55] Thread 10 waiting for first job
[2022-08-27 19:12:55] Thread 8 waiting for first job
[2022-08-27 19:12:55] Thread 9 waiting for first job
[2022-08-27 19:12:55] Thread 11 waiting for first job
[2022-08-27 19:12:56] Extranonce disabled, subscribe timed out
[2022-08-27 19:12:56] Stratum connection established
[2022-08-27 19:12:56] Threads restarted for new work.
[2022-08-27 19:12:56] New Stratum Diff 16384, Block 4366382, Job 5187
                      Diff: Net 4.6222e+06, Stratum 16384, Target 0.25
[2022-08-27 19:13:04] < {"id":null,"method":"mining.notify","params":["5188","a5bea714b19a2490f7aacde03277812a3c74193a6bba2d588478374e4812284c","01000000010000000000000000000000000000000000000000000000000000000000000000ffffffff20032fa0420400a50a6308","0d2f6e6f64655374726174756d2f00000000012780cffde80000001976a914f6c7f1c2cd06849dd836bb2f40244741dbc0c4fd88ac00000000",["43ff4bbcc7526c375f6f22b7a816b6b2cbc699f7afd87b154a137e04ec37c5c2","cfefdbffb5c18a48a57ac2de56c2ddd6e53e98165afafd10f219df7956053cdf","41c53c48499cdc1dccc7a2f19ef6fc7fd87775bf63619982951ff665d373eba7"],"00620004","1a034445","630aa500",true]}
[2022-08-27 19:13:05] CPU temp: curr 41 C max 0, Freq: 3.211/3.272 GHz
[2022-08-27 19:13:05] Threads restarted for new work.
[2022-08-27 19:13:05] New Block 4366383, Net diff 5.1358e+06, Job 5188
                      Diff: Net 5.1358e+06, Stratum 16384, Target 0.25
                      TTF @ 72.32 kh/s: Block 9671y181d, Share 4h07m
                      Net hash rate (est) 1838.17 Th/s
[2022-08-27 19:13:39] < {"id":null,"method":"mining.notify","params":["5189","752dc8a3330703e8f89a125bb58aac4e3113e0467b0c9ba0ac41eff721f1e42a","01000000010000000000000000000000000000000000000000000000000000000000000000ffffffff200330a0420423a50a6308","0d2f6e6f64655374726174756d2f0000000001386934dce80000001976a914f6c7f1c2cd06849dd836bb2f40244741dbc0c4fd88ac00000000",["57f6b91a72bfa999cbfc5bf9334555bb18a7b339cdfa54f65baa35f41825507f","f9e883f0f65cb06725c521566f470ca44e763c389691e584cda4db0c0e30a58f"],"00620004","1a02fe94","630aa523",true]}
[2022-08-27 19:13:39] Threads restarted for new work.
[2022-08-27 19:13:39] New Block 4366384, Net diff 5.6027e+06, Job 5189
                      Diff: Net 5.6027e+06, Stratum 16384, Target 0.25
                      TTF @ 83.53 kh/s: Block 9134y9d, Share 3h34m
                      Net hash rate (est) 1046.23 Th/s

Going to try to narrow down the point where it crashes now.

JayDDee commented 1 year ago

I'm starting to suspect an issue with the wallet, do both systems have their own wallets? If they're on the same network try mining on the other's wallet.

slightlyskepticalpotat commented 1 year ago

They were originally on different wallets. I tried mining with the 3500u wallet, 5500u wallet, and a newly created wallet on both systems, but the 3500u system always worked and the 5500u system always gave a segfault. Now trying to narrow down the point of the crash.

slightlyskepticalpotat commented 1 year ago

As you mentioned, I was able to confirm that it first crashes here. Going further into the code, it crashes here. I was able to track it to this for loop, where it looked like the program looped through it a few times, then crashed.

This is where the mystery deepens.

I changed the for loop (with no other changes to the code) to:

for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
{
    applog( LOG_INFO, "working");
    work->target[7 - i] = be32dec( target + i );
}

And it began solving blocks. The hashrate seems to match with what I was seeing in benchmarks. and the miner was indistinguishable from the working system apart from the junk output. Could there be some sort of race condition here?

$ ./cpuminer --algo=scrypt --url=http://127.0.0.1:44555 --user=user --pass=pass --coinbase-addr=nfPAPyGGjsuyqRyxFfCmnA4C9cH5smSi6g
         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 21:30:40] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 21:30:40] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 21:30:40] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 21:30:40] CPU temp: curr 42 C max 0, Freq: 0.997/1.812 GHz
[2022-08-27 21:30:40] scrypt: http://127.0.0.1:44555
                      Periodic Report     584942417355y130d        0m00s
                      Share rate        -0.00/min     0.00/min
                      Hash rate         -0.00h/s      0.00h/s   (0.00h/s)
                      Submitted             0            0
                      Accepted              0            0        0.0%
                      Hi/Lo Share Diff  0 /  9e+99
[2022-08-27 21:30:40] New Block 4013576, Net Diff 0.00024414, Ntime 40c50a63
                      Miner TTF @ 240.00 h/s 1h12m, Net TTF @ 9922.63 h/s 1m45s
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:56] 1 Submitted Diff 0.00039597, Block 4013576, Ntime 4cc50a63
[2022-08-27 21:30:56] 1 A1 S0 R0 BLOCK SOLVED 1, 15.620 sec (1ms)
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] New Block 4013577, Net Diff 0.00026158, Ntime 50c50a63
                      Miner TTF @ 83.73 kh/s 0m13s, Net TTF @ 9970.20 h/s 1m52s
[2022-08-27 21:30:58] 2 Submitted Diff 0.00047437, Block 4013577, Ntime 50c50a63
[2022-08-27 21:30:58] 2 A2 S0 R0 BLOCK SOLVED 2, 2.305 sec (1ms)

JayDDee commented 1 year ago

Holy shit, good work. Can you get the loop counter and ARRAY_SIZE?

Edit: It's silly code, ARRAY_SIZE controls the loop but hard coded 7 is used inside, but they should match. The target is just the 256 bit hash expressed as a uint32 array. I don't like that ARRAY_SIZE macro, might as well hard code it to 8 since the array's size is assumed inside the loop anyway.

Edit2: I realize the stupidity of my first question. Capturing the loop counter makes the problem go away so it will always be 8. Maybe ARRAY_SIZE can be captured before enterring the loop without changing the behaviour.

I suspect the compiler is building that section of code differently when you add the printf. The loop is more likely to be unrolled, or even vectorized, without the printf. Try compiling with lower optimization to see if that makes a difference.

I'm not sure we'll get to the root cause but getting rid of ARRAY_SIZE macro might be a good start. I'm not a C expert so I'm not sure if its implementation is correct., On the surface I don't see a problem with it.

slightlyskepticalpotat commented 1 year ago

It's a bit of a challenge as placing a printf or a file write there also seems to fix it. The code I am using is

   for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
   {
      // applog( LOG_INFO, "working");
      printf ("%d %d\n", i, ARRAY_SIZE( work->target ));
      work->target[7 - i] = be32dec( target + i );
   }
   fflush(stdout);

It generates output like this:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 21:47:32] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 21:47:32] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 21:47:32] CPU affinity [!!!!!!!!!!!!]
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
[2022-08-27 21:47:32] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 21:47:32] CPU temp: curr 43 C max 0, Freq: 1.032/2.298 GHz
[2022-08-27 21:47:32] scrypt: http://127.0.0.1:44555
                      Periodic Report     584942417355y130d        0m00s
                      Share rate        -0.00/min     0.00/min
                      Hash rate         -0.00h/s      0.00h/s   (0.00h/s)
                      Submitted             0            0
                      Accepted              0            0        0.0%
                      Hi/Lo Share Diff  0 /  9e+99
[2022-08-27 21:47:32] New Block 4013605, Net Diff 0.00071352, Ntime 34c90a63
                      Miner TTF @ 240.00 h/s 3h32m, Net TTF @ 13.12 kh/s 3m53s
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
[2022-08-27 21:48:40] 1 Submitted Diff 0.00083736, Block 4013605, Ntime 75c90a63
[2022-08-27 21:48:40] 1 A1 S0 R0 BLOCK SOLVED 1, 67.927 sec (2ms)
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
[2022-08-27 21:48:40] New Block 4013606, Net Diff 0.00065864, Ntime 78c90a63
                      Miner TTF @ 80.72 kh/s 0m35s, Net TTF @ 13.44 kh/s 3m30s

Any ideas on how I could output i while not writing to stdout or a file? Also, what does be32dec do? I think the problem may be inside there.

JayDDee commented 1 year ago

We're getting some crosstalk, I'm getting caught up with what you just wrote, I made some further comments above.

Edit: as I suspected might happen. Compiler optimization is playing factor but hard coding the array's size might solve the problem. (meaning the crash)

Edit: be32dec is a byte swap function used to convert from Little Endian to Big Endian. It's written to be agnostic, it will return big endian data regardless of the current byte order. Intel (I mean x86 including Ryzen of course, duh) CPUs are Little Endian so it always does a byte swap.

slightlyskepticalpotat commented 1 year ago

I'm even less of a C expert, but I gave it a shot. I started every try with a fresh clone of the repo. Just curious, how did you guess that compiler optimisation was playing a factor in this? Past compiler horror stories? -O0 -march=native -Wall: errors out during compilation -O1 -march=native -Wall: works normally, appears slightly slower -O2 -march=native -Wall: works normally, appears slightly faster -O3 -march=native -Wall: segfaults -Os -march=native -Wall: also errors during compilation

JayDDee commented 1 year ago

I'm not sure at what level loop unrolling occurs but vectorization is possible on a fixed sized loop and needs -O3. The entire array can be byte swapped in one shot using AVX2. With a printf in the loop vectorization isn't possible but loop unrolling still is.

slightlyskepticalpotat commented 1 year ago

It's definitely vectorization. -O3 -fno-tree-vectorize -march=native -Wall builds and works properly.

JayDDee commented 1 year ago

The only possibilities are wrong loop size, bad target pointer, or data is misaligned for AVX2.

I've seen misaligned data before in hand coded vector instructions. but here the compiler is deciding to vectorize so it should check alignment before doing so. Also the data is defined with 64 bit aligment, which is more than enough for AVX2. I'm dismissing this as a possibility.

Array size error seems more likely especially if it looped a couple of time before crashing. That's a classic buffer overflow. A bad pointer would be expected to segfault on the first loop iteration.

Capturing ARRAY_SIZE( work->target ) is critical. Displaying it before the for loop should still allow the loop to be vectorized and crash. Or just hard code the loop to 8 and see if the crash goes away.

I think I'll get rid of the macro. It's used mostly for target and hash who's size is fixed. Using the macro is unnecessary.

Edit: I think I've found part of the problem, misalignment is a possibility for the source target, I was only thinking of the destination work->target. This still involves a compiler bug because it should have detected the misalignment before vectorizing.

Here's a look at the definitions with alignment added where necessary:

static bool gbt_work_decode( const json_t *val, struct work *work ) { int i, n; uint32_t version, curtime, bits; uint32_t prevhash[8] __attribute__ ((aligned (32))); uint32_t target[8] __attribute__ ((aligned (32))); unsigned char final_sapling_hash[32] __attribute__ ((aligned (32))); int cbtx_size; uchar *cbtx = NULL; int tx_count, tx_size; uchar txc_vi[9]; uchar(*merkle_tree)[32] = NULL; bool coinbase_append = false; bool submit_coinbase = false; bool version_force = false; bool version_reduce = false; json_t *tmp, *txa; bool rc = false;

slightlyskepticalpotat commented 1 year ago

   printf("%d\n", ARRAY_SIZE( work->target ));
   fflush(stdout);
   for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
   {
      work->target[7 - i] = be32dec( target + i );
   }

Gives array size as 8 before the loop starts. Additionally, if I hardcode i < 8 it still segfaults.

especially if it looped a couple of time before crashing

Unfortunately, I later realised that I wasn't sure if it looped before crashing. I originally had it print the iteration count at the end of each iteration and saw it increase, but that was before I realised it would fix the issue. Hence, I'm not sure now if the loop completes any iterations before crashing.

JayDDee commented 1 year ago

Stay tuned I think I've found it!!!

JayDDee commented 1 year ago

Damn, code formatting never works for me.

I think I've found part of the problem, misalignment is a possibility for the source target, I was only thinking of the destination work->target. This still involves a compiler bug because it should have detected the misalignment before vectorizing.

Here's a look at the definitions with alignment added where necessary:

static bool gbt_work_decode( const json_t val, struct work work ) { int i, n; uint32_t version, curtime, bits; uint32_t prevhash[8] attribute ((aligned (32))); uint32_t target[8] attribute ((aligned (32))); unsigned char final_sapling_hash[32] attribute ((aligned (32))); int cbtx_size; uchar cbtx = NULL; int tx_count, tx_size; uchar txc_vi[9]; uchar(merkle_tree)[32] = NULL; bool coinbase_append = false; bool submit_coinbase = false; bool version_force = false; bool version_reduce = false; json_t tmp, txa; bool rc = false;

I don't know why attribute was in bold but it helps identify the three lines that need to be changed. That int i being the first local variable guarantees that the following arrays are misaligned. Alway define arrays first. I need to do a code review to look for other similar situations.

slightlyskepticalpotat commented 1 year ago

I'm probably going to go to sleep soon, but do let me know if you need any help testing! I don't understand vectorization enough to guess—do you have a guess as to why this problem only shows up on some systems?

JayDDee commented 1 year ago

Can you do a quick test with alignment added since you can reproduce the crash, I can't. I'm pretty confident now but confirmation would be nice.

Agree on the sleep, we must be in the same time zone. If this works I'll sleep well tonight.

slightlyskepticalpotat commented 1 year ago

static bool gbt_work_decode( const json_t val, struct work work ) { int i, n; uint32_t version, curtime, bits; uint32_t prevhash[8] attribute ((aligned (32))); uint32_t target[8] attribute ((aligned (32))); unsigned char final_sapling_hash[32] attribute ((aligned (32))); int cbtx_size; uchar cbtx = NULL; int tx_count, tx_size; uchar txc_vi[9]; uchar(merkle_tree)[32] = NULL; bool coinbase_append = false; bool submit_coinbase = false; bool version_force = false; bool version_reduce = false; json_t tmp, txa; bool rc = false;

Are you able to put this up on pastebin so I can download and test it? I think GitHub may have removed some of the underscores.

Edit: nevermind, I got it.

JayDDee commented 1 year ago

You're right. Two leading and 2 trailing undescrores in attribute. There are many examples in the code if you grep -r attribute.

slightlyskepticalpotat commented 1 year ago

Oops. I did it with this and it segfaulted again.

static bool gbt_work_decode( const json_t *val, struct work *work )
{
   int i, n;
   uint32_t version, curtime, bits;
   uint32_t prevhash[8] __attribute__(( aligned(32)));
   uint32_t target[8] __attribute__(( aligned(32)));
   unsigned char final_sapling_hash[32] __attribute__(( aligned(32)));
   int cbtx_size;
   uchar *cbtx = NULL;
   int tx_count, tx_size;
   uchar txc_vi[9];
   uchar(*merkle_tree)[32] = NULL;
   bool coinbase_append = false;
   bool submit_coinbase = false;
   bool version_force = false;
   bool version_reduce = false;
   json_t *tmp, *txa;
   bool rc = false;

Edit: I did figure out the minor mystery of attribute being in bold though. Turns out when you do type __this__ on GitHut it shows as this.

JayDDee commented 1 year ago

Oh well, maybe have to sleep on it.

JayDDee commented 1 year ago

Some thoughts to sleep on...

The crash is indeed reported as a segfault. A misaligned address should throw a processor exception in the same way a divide by zero or invalid instruction does. A segfault should allways be an invalid pointer address. At least that's the way it works on some processor architectures I'm more familiar with.

Counterpoint, it only crashes when greater-than-default data alignment is required, that is when the loop is vectorized.

We need to see the work->target & target pointers.

slightlyskepticalpotat commented 1 year ago

Some last tests before sleep.

Code:

   printf("%p %p\n", work->target, target);
   for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
      work->target[7 - i] = be32dec( target + i );

WIth the __attribute__(( aligned(32))) patch:

0x7f677c002150 0x7f6782a40ce0
Segmentation fault (core dumped)

Without the patch:

0x7f88b8002150 0x7f88beabfce0
Segmentation fault (core dumped)

Unfortunately, I don't have a very good understanding of pointers so I'm mostly lost here.

JayDDee commented 1 year ago

You're one up on me, I wasn't aware of %p.

Both those pointers are properly aligned. The low 6 address bits are zero which provides 64 byte alignment, more than requested. Both pointers also look good. I don't know the memory mapping but both are within 4 GB of each other.

There's something else going on that seems to be specific to your CPU. This copy loop is used frequently in stratum code and has never crashed before. It also doesn't crash on your other CPU with every other controllable variable the same.

I assume the same vectorization occurs on that CPU.

The only difference architecturally is the addition of VAES in Ryzen 5000. That will result in kernel changes as well as any AES related code. Much of the affected code is in cpuminer-opt but only in the hashing code and definitely not anywhere near where it's crashing.

At this point it looks like one or both properly aligned and apparently valid pointers is causing a segfault when the optimiser auto-vectorizes a loop. But if auto-vectorization is disabled in the compiler there is no segfault. It occurs persistently on one particular CPU and never on a very similar CPU with identical OS, compiler and source code.

The crash itself is a mystery, from all the data available it shouldn't crash. That it doesn't crash on the Ryzen 3500U, or apparently anywhere else, makes it even more mysterious. That it only crashes when the code is auto-vectorized, well...

I'm stumped.

slightlyskepticalpotat commented 1 year ago

Finally did some testing from a live usb of 22.04. Still segfaults, so I don't think it's the system. -O2 worked as expected.

With the patch: 0x7f1878002150 0x7f187f5dece0

Without the patch: 0x7fba58002150 0x7fba5f1d7ce0

On Sun., Aug. 28, 2022, 01:25 JayDDee, @.***> wrote:

You're one up on me, I wasn't aware of %p.

Both those pointers are properly aligned. The low 6 address bits are zero which provides 64 byte alignment, more than requested. Both pointers also look good. I don't know the memory mapping but both are within 4 GB of each other.

There's something else going on that seems to be specific to your CPU. This copy loop is used frequently in stratum code and has never crashed before. It also doesn't crash on your other CPU with every other controllable variable the same.

I assume the same vectorization occurs on that CPU.

The only difference architecturally is the addition of VAES in Ryzen 5000. That will result in kernel changes as well as any AES related code. Much of the affected code is in cpuminer-opt but only in the hashing code and definitely not anywhere near where it's crashing.

At this point it looks like one or both properly aligned and apparently valid pointers is causing a segfault when the optimiser auto-vectorizes a loop. But if auto-vectorization is disabled in the compiler there is no segfault. It occurs persistently on one particular CPU and never on a very similar CPU with identical OS, compiler and source code.

The crash itself is a mystery, from all the data available it shouldn't crash. That it doesn't crash on the Ryzen 3500U, or apparently anywhere else, makes it even more mysterious. That it only crashes when the code is auto-vectorized, well...

I'm stumped.

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1229379838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYE3KLTTDAXQRHZBX6LV3LZ5PANCNFSM57ZRU27Q . You are receiving this because you authored the thread.Message ID: @.***>

JayDDee commented 1 year ago

I don't really know where to go from here. My first suspect is the CPU itself but I'm not confident enough to point the finger. I assume your CPU is up to date with any fixes from AMD.

The only thing I can think of is tinkering around the edges, change the GCC version or OS version to try to provoke different behaviour. It's essentially poking around the black box to try to get a reaction from inside. Opening the black box would be better but I don't know how and if I did I wouldn't know what to look for once inside.

Since you are the only one known to have experienced this crash I'll leave it up to you if you want to try to dig deeper. I'm still curious so the issue should remain open but in a dormant state awaiting new data that might provide a lead I can pursue

Edit: Depending on your comfort level you could hand code the vectored byte swap of 256 bits of data. I just happen to have a macro to do exactly that. It's called "mm256_bswap_32" and it's defined in simd-utils/simd-256.h. This should isolate the compiler's optimizer from the process and talk directly to the CPU.

Efdit2: GDB could also be used to display the assembly code to confirm the byteswap was correctly coded. The vector memory accesses should use a VMOVQDA instruction and the byte swap itself should be a VPSHUFB.

slightlyskepticalpotat commented 1 year ago

I've installed all the updates, so it should have all the fixes. I've learned a lot from this so far, so I think I'm going to try and poke at it a bit more with the versions and vectored byte swap over the next few days. I'll keep you updated if I find anything interesting!

JayDDee commented 1 year ago

Learning is always good. Her's a resource to help you understand the vector instructions, I couldn't could have done any of this without it.

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#

JayDDee commented 1 year ago

I will address a couple of peripheral issues discovered during the investigation I wil stop using ARRAY_SIZE for fixed sized arrays like hash and target. Instead I'll define a constant that will be used to define the array's size and control any for loops. ARRAY_SIZE can still be used for dynamically sized arrays.

I will also do the code review to look for potential data misalignment issues. I still run into alignment problems when adding new vectorized code, most often when accessing legacy data structures in hashing code. Core code should be immune because GCC should enforce necessary alignment when auto-vectorizing (data seems to support that). But just to be safe I'll force alignment on any data structures that relate to hash or may be accessed with vector instructions.

I have no timetable for a release, I don't have enough other material to warrant a new release yet and I'd like to see how this issue progresses in the chance something interesting is found in cpuminer-opt.

slightlyskepticalpotat commented 1 year ago

I don't think I understand enough C to hand-code it right now. If I'm understanding the following code correctly, it should build a vector with the values, swap the bytes, and copy it to the desired location. However, it gives a compile error. Do you see anywhere I've gone wrong?

   memcpy(work->target, mm256_bswap_32(_mm256_set_epi32(target + 7, target + 6, target + 5, target + 4, target + 3, target + 2, target + 1, target)), 256);
   /*for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
      work->target[7 - i] = be32dec( target + i );*/

./simd-utils/simd-256.h:495:4: error: incompatible type for argument 2 of ‘memcpy’
  495 |    _mm256_shuffle_epi8( v, \
      |    ^~~~~~~~~~~~~~~~~~~~~~~~~
      |    |
      |    __m256i
  496 |          m256_const_64( 0x1c1d1e1f18191a1b, 0x1415161710111213, \
      |          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  497 |                         0x0c0d0e0f08090a0b, 0x0405060700010203 ) )
      |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cpu-miner.c:902:25: note: in expansion of macro ‘mm256_bswap_32’
  902 |    memcpy(work->target, mm256_bswap_32(_mm256_set_epi32(target + 7, target + 6, target + 5, target + 4, target + 3, target + 2, target + 1, target)), 256);
      |                         ^~~~~~~~~~~~~~
In file included from /usr/include/features.h:486,
                 from /usr/include/x86_64-linux-gnu/bits/libc-header-start.h:33,
                 from /usr/include/stdio.h:27,
                 from cpu-miner.c:27:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:26:1: note: expected ‘const void * restrict’ but argument is of type ‘__m256i’
   26 | __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
      | ^~~~~
make[2]: *** [Makefile:2824: cpuminer-cpu-miner.o] Error 1

JayDDee commented 1 year ago

Don't use memcpy, just *( (__m256i*)( work->target ) ) = mm256_bswap_32( *( (__m256i*)target ) );

edit: added pointer cast, tripped over github formatting again, trying again.

slightlyskepticalpotat commented 1 year ago

My bad, I tried something like that before but missed a bracket. It still segfaults, but this is the gdb output. I see VMOVQDA, but scrolled back a bit and couldn't find VPSHUFB. How far back would it be?

Just to be clear, the code I'm using right now is:

   *( (__m256i*)( work->target ) ) = mm256_bswap_32( *( (__m256i*)target ) );
   /*for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
      work->target[7 - i] = be32dec( target + i );*/

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-28 14:48:17] Scrypt paramaters: N= 1024, R= 1
[2022-08-28 14:48:17] Throughput 8/thr, Buffer 256 kiB/thr, Total 256 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 28 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-28 14:48:17] Coinbase address uses B58 coding
[2022-08-28 14:48:17] CPU affinity [!!!!!!!!!!!!]
[New Thread 0x7ffff6870600 (LWP 93487)]
[New Thread 0x7fffeffff600 (LWP 93488)]
[2022-08-28 14:48:17] 1 of 12 miner threads started using 'scrypt' algorithm
[2022-08-28 14:48:17] Default miner thread priority 0 (nice 19)
[2022-08-28 14:48:17] Binding thread 0 to cpu 0

Thread 2 "cpuminer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6870600 (LWP 93487)]
0x000055555555efe2 in ?? ()
(gdb) display/32i $pc
1: x/32i $pc
=> 0x55555555efe2:  vmovdqa %ymm0,(%r14)
   0x55555555efe7:  vzeroupper 
   0x55555555efea:  call   0x55555556c550
   0x55555555efef:  mov    %r15,%rdi
   0x55555555eff2:  vmovsd %xmm0,0x100(%r14)
   0x55555555effb:  lea    0x22e28f(%rip),%rsi        # 0x55555578d291
   0x55555555f002:  vmovsd %xmm0,0x2a150e(%rip)        # 0x555555800518
   0x55555555f00a:  call   0x55555555ca90 <json_object_get@plt>
   0x55555555f00f:  mov    %rax,%rdi
   0x55555555f012:  test   %rax,%rax
   0x55555555f015:  je     0x55555555e0fc
   0x55555555f01b:  cmpl   $0x2,(%rax)
   0x55555555f01e:  je     0x55555555f0b3
   0x55555555f024:  lea    0x22e259(%rip),%rsi        # 0x55555578d284
   0x55555555f02b:  mov    $0x3,%edi
   0x55555555f030:  xor    %eax,%eax
   0x55555555f032:  xor    %r13d,%r13d
   0x55555555f035:  call   0x555555568550
   0x55555555f03a:  jmp    0x55555555e0fc
   0x55555555f03f:  movslq %r14d,%rax
   0x55555555f042:  inc    %r14d
   0x55555555f045:  shl    $0x5,%rax
   0x55555555f049:  lea    (%rbx,%rax,1),%rdx
   0x55555555f04d:  lea    -0x20(%rbx,%rax,1),%rax
   0x55555555f052:  vmovdqu (%rax),%xmm3
   0x55555555f056:  vmovdqu %xmm3,(%rdx)
   0x55555555f05a:  vmovdqu 0x10(%rax),%xmm4
   0x55555555f05f:  vmovdqu %xmm4,0x10(%rdx)
   0x55555555f064:  jmp    0x55555555eeea
   0x55555555f069:  lea    0x22e1f6(%rip),%rsi        # 0x55555578d266
   0x55555555f070:  mov    $0x3,%edi
   0x55555555f075:  xor    %eax,%eax
(gdb)

JayDDee commented 1 year ago

If that is where it crashed it looks like it crashed on the store to the destination. Note the following call to json_object_get to get a point of reference in the source code. You need to back up several instructions to catch the source load and the shuffle. There will also be some other code to generate the shuffle index.

However I think this test helped eliminate the compiler as the culprit. You captured the crash and you can display the contents of the vector pointer in r14. If you can reproduce the segfault with an older OS and compiler with both the compiled vector bswap and the hand coded vector bswap it eliminates everything except the CPU as the problem.

Edit: corrected pointer regsiter reference. A vector pointer is in an R register, the vector data is in YMM.

slightlyskepticalpotat commented 1 year ago

Here are the last few instructions for reference. I think I'm going to try it on 20.04 on the same machine now to see if I can confirm the CPU as the problem.

(gdb) x/16i $pc-32
   0x55555555efc2:  test   %bh,%dh
   0x55555555efc4:  add    (%rax),%eax
   0x55555555efc6:  add    %al,%ch
   0x55555555efc8:  std    
   0x55555555efc9:  outsl  %ds:(%rsi),(%dx)
   0x55555555efca:  lods   %ds:(%rsi),%eax
   0x55555555efcb:  mov    $0xfe,%al
   0x55555555efcd:  (bad)  
   0x55555555efce:  decl   -0x4b(%rbx,%rcx,4)
   0x55555555efd2:  jo     0x55555555efd2
   0x55555555efd4:  (bad)  
   0x55555555efd5:  decl   -0x9(%rcx,%rcx,4)
   0x55555555efd9:  
    vpshufb 0x2322de(%rip),%ymm5,%ymm0        # 0x5555557912c0
=> 0x55555555efe2:  vmovdqa %ymm0,(%r14)
   0x55555555efe7:  vzeroupper 
   0x55555555efea:  call   0x55555556c550

And here's the data in the registers—can't make any sense of it though.

(gdb) i r r14
r14            0x7ffff0002150      140737219928400
(gdb) i r ymm0
ymm0           {v16_bfloat16 = {0xfff, 0x0, 0x0, 0xff00, 0x0 <repeats 12 times>}, v16_half = {0xfff, 0x0, 0x0, 0xff00, 0x0 <repeats 12 times>}, v8_float = {0xfff, 0xff000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0xff00000000000fff, 0x0, 0x0, 0x0}, v32_int8 = {0xff, 0xf, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0 <repeats 24 times>}, v16_int16 = {0xfff, 0x0, 0x0, 0xff00, 0x0 <repeats 12 times>}, v8_int32 = {0xfff, 0xff000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0xff00000000000fff, 0x0, 0x0, 0x0}, v2_int128 = {0xff00000000000fff, 0x0}}
(gdb) i r ymm5
ymm5           {v16_bfloat16 = {0x0, 0xff0f, 0xff, 0x0 <repeats 13 times>}, v16_half = {0x0, 0xff0f, 0xff, 0x0 <repeats 13 times>}, v8_float = {0xff0f0000, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0xffff0f0000, 0x0, 0x0, 0x0}, v32_int8 = {0x0, 0x0, 0xf, 0xff, 0xff, 0x0 <repeats 27 times>}, v16_int16 = {0x0, 0xff0f, 0xff, 0x0 <repeats 13 times>}, v8_int32 = {0xff0f0000, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0xffff0f0000, 0x0, 0x0, 0x0}, v2_int128 = {0xffff0f0000, 0x0}}

JayDDee commented 1 year ago

The data isn't really important it's the pointer. I'm not sure about the upper bits of the pointer but the lower bits look the same as previous tests. I don't see the load of the source which should be in ymm5. The optimizer may have moved it earlier so it would be avalable when needed.

I don't know what (bad) means. It didn't crash so it's not an invalid instruction. Maybe it's just an artifact from the compiler optimization. I don't think it changes my opinion that it's a CPU problem.

If it's reproduceable with different GCC/OS versions my opinion will be stronger.

Edit: Somewhat speculative but the different high bits in the adddres could be due to different ways of referencing virtual memory. Virtual addresses are always made up of two parts: the segment, and the segment offset. If the source address looks the same and didn't crash we can probably dismiss it.

slightlyskepticalpotat commented 1 year ago

Interesting...on gcc 9.4.0 and Ubuntu 20.04, the compiler-optimised version worked, but the hand-coded version segfaulted. On gcc 11.1.0 and 20.04, the hand-coded version and compiler-optimised version both segfaulted. I assume the automatic optimisation method changed sometime between gcc 9 and 11.

To me, it seems to be a particular cpu (or cpu architecture, I don't see why my particular cpu would be different) and compiler combination.

On Sun., Aug. 28, 2022, 15:27 JayDDee, @.***> wrote:

The data isn't really important it's the pointer. I'm not sure about the upper bits of the pointer but the lower bits look the same as previous tests. I don't see the load of the source which should be in ymm5. The optimizer may have moved it earlier so it would be avalable when needed.

I don't know what (bad) means. It didn't crash so it's not an invalid instruction. Maybe it's just an artifact from the compiler optimization. I don't think it changes my opinion that it's a CPU problem.

If it's reproduceable with different GCC/OS versions my opinion will be stronger.

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1229536014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYEEYU4FPYK2VHTR47LV3O4RFANCNFSM57ZRU27Q . You are receiving this because you authored the thread.Message ID: @.***>

JayDDee commented 1 year ago

I tend to agree. Most likely a bad part, maybe a bad batch of 5500U, maybe a mobile only issue, desktops not likely affected. Mobile CPUs are rarely used for mining (I don't recommend doing it) so it's less likely an issue like this affecting only mobile zen3 would be found.

slightlyskepticalpotat commented 1 year ago

I haven't noticed anything wrong with other programs and found a workaround for this one, so I'm happy. Just needed to mine some testnet dogecoin to test with :). Also, the 5500U is actually Zen 2—does that change anything architecture-wise?

JayDDee commented 1 year ago

It might make it more like the 3500U if it's also zen2 based. It also means the architecture has a lot more history and less likely to have undiscovered problems.

I was wondering about issues with other programs and assumed you'd have mentioned it. There may be CPU test programs available, I've also heard 7zip uses vectors extensively.

This was fun, it's unfortunate I couldn't take it to the end but it needs more expertise in x86 and AMD's implementation of it.

slightlyskepticalpotat commented 1 year ago

I extensively use 7z and haven't noticed a problem so far, so I assume everything else is fine. Going to close the issue now—you're right, this was fun!

JayDDee commented 1 year ago

One final idea that might narrow down the precise fault. In the handcoded patch you could separate the load & store and use loadu which is intended to safely access misaligned data:

__m256i x = mm256_bswap_32( _mm256_loadu_si256( (__m256i*)target ) ); _mm256_storeu_si256( ( (__m256i*)(work->target, x);

This should help remove the ambiguity of misaligned access fault (it only faults on vector access) and write protection fault (the load works but the store faults). I'm not suggesting that's really occurring but the CPU thinks it is.

slightlyskepticalpotat commented 1 year ago

I assume it's like this with the brackets (I got a compilation error when trying to compile that)?

   __m256i x = mm256_bswap_32( _mm256_loadu_si256( (__m256i*)target ) );
   _mm256_storeu_si256( ( (__m256i*)(work->target)), x);

Edit: For unknown reasons, when I use those two lines it doesn't crash but also doesn't find any blocks.

JayDDee commented 1 year ago

Yes I was adding another level then realized it wasn't needed but didn't remove the left bracket, you fixed it by adding the right bracket.

Interesting result but I'm not what it means exactly. It would appear the patch failed to properly bswap the data and store it in work->target. My best guess is it's another symptom of the same problem. The segfault was avoided suggesting that was the initial fault was alignment related and was avoided by using the "safe" accesses. But the operation still failed to execute properly.

It's still a mystery why it's only this small piece of code.

slightlyskepticalpotat commented 1 year ago

(Sorry, I don't think you saw the edit I made to my previous message.)

On Sun, Aug 28, 2022 at 6:41 PM JayDDee @.***> wrote:

Yes I was adding another level then realized it wasn't needed but didn't remove the left bracket, you fixed it by adding the right bracket.

— Reply to this email directly, view it on GitHub https://github.com/JayDDee/cpuminer-opt/issues/379#issuecomment-1229569277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHOXLYG5VH7UBIW6PFON623V3PTJJANCNFSM57ZRU27Q . You are receiving this because you modified the open/close state.Message ID: @.***>

JayDDee commented 1 year ago

I think I did. I said the operation failed. The target is used to test the hash difficulty in deciding to submit it. No blocks were found because the target was invalid and nothing passed the test and got submitted. It could have also gone the other way and invalid hash was submitted and promptly rejected. The target was likely all zeros which represents infinite difficulty, so nothing would ever pass the test.

JayDDee / cpuminer-opt

Segfault on v3.20.2 and Ryzen 5 5500U #379