JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
777 stars 549 forks source link

Segfault using scrypt:1048576 @ linux #228

Closed YetAnotherRussian closed 4 years ago

YetAnotherRussian commented 4 years ago

Tried to compile latest version on both Ubuntu 19.10 x64 and Ubuntu 18.04.3 LTS x64 with both GCC 9.2.1 & GCC 8.3:

-O2 -march=native -Wall -O3 -march=native -Wall -Ofast -march=native -Wall

Tried -march=znver2 as well

Using algo scrypt:1048576, CPU is Ryzen 9 3900X, 16Gb RAM

** cpuminer-opt 3.11.3 *** A CPU miner with multi algo support and optimized for CPUs with AVX512, SHA and VAES extensions. BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2020-01-14 04:51:06] Scrypt paramaters: N= 1048576, R= 1. CPU: AMD Ryzen 9 3900X 12-Core Processor . SW built on Jan 14 2020 with GCC 9.2.1. CPU features: AVX2 AES SHA SW features: AVX2 AES SHA Algo features: AVX2

Starting miner with AVX2...

[2020-01-14 04:51:06] 16 CPU cores available, 1 miner threads selected. [2020-01-14 04:51:06] Starting Stratum on stratum+tcp://london.blockbucket.net:3003 [2020-01-14 04:51:06] 1 miner threads started, using 'scrypt' algorithm. [2020-01-14 04:51:15] New stratum diff 0.01, block 424884, job 8b98 london.blockbucket.net:3003 scrypt block 424884 Diff: net 0.00259958, stratum 0.01, target 1.52588e-07 Segmentation fault (core dumped)

Changing thread count to odd or even values does not help.

There is no issues @ Windows (paging file size is 32Gb @ nvme), but when I use more than 6 threads, miner quits:

[2020-01-14 16:03:28] 24 CPU cores available, 12 miner threads selected. [2020-01-14 16:03:28] Starting Stratum on stratum+tcp://london.blockbucket.net:3003 [2020-01-14 16:03:28] 12 miner threads started, using 'scrypt' algorithm. [2020-01-14 16:03:28] Thread 7: Scrypt buffer allocation failed [2020-01-14 16:03:28] Thread 11: Scrypt buffer allocation failed [2020-01-14 16:03:28] FAIL: thread 11 failed to initialize [2020-01-14 16:03:28] Thread 10: Scrypt buffer allocation failed [2020-01-14 16:03:28] FAIL: thread 10 failed to initialize [2020-01-14 16:03:28] Thread 9: Scrypt buffer allocation failed

Pool: blockbucket.net Coin: Verium (https://miningpoolstats.stream/veriumreserve) To get test address: visit coinexchange.io (or just use VVbHMJHbfK4RCxbfpayryR7SXs4mek5wJq, it's a test one)

So it's Scrypt algo using N=2^20 or scrypt:1048576 (to use with yours version). No issues with rejected shares or something btw. I'm unsure but this one should get SHA_OPT as well...

Don't know why the memory requirement is so huge, should be around 128Mb per thread, or similar I guess.

Thanks for any possible assistance anyway.

Btw, there's something like reference version - https://github.com/fireworm71/veriumMiner, which does not crash on RAM usage (I've managed to compile that one with MSVC, but I can't do that with cpuminer-opt, as you know).

JayDDee commented 4 years ago

Not sure if -N works for scrypt, I may have got the math wrong. Try -a scrypt:1048576 without -N and -R (-R 1 is default). That used to work AFAIK. It was smart enough to figure out whether it was N or Nfactor.

Might work for Windows, not likely for Linux. It's interesting it crashes on Linux, but not on Win. Usually it's the other way around.

I can't dig into it at the moment but If you're familiar with gdb it can find where it's crashing. That would give me a head start.

Excellent problem description BTW, good data.

JayDDee commented 4 years ago

I have a question. How long after the block info is displayed does it crash? Is it near instantaneous or does it run for a noticeable time before crashing?

YetAnotherRussian commented 4 years ago

I have not passed the -R cli arg, just used "... -a scrypt:1048576 -o ... -u ..." like you said, the R loggind is hardcoded:

if ( !opt_param_n ) { opt_param_n = 1024; scratchbuf_size = 1024; } else scratchbuf_size = opt_param_n; applog(LOG_INFO,"Scrypt paramaters: N= %d, R= 1.", opt_param_n );

I've looked into the code to figure out how this value is set and how to set it myself before. Somewhere it's set from the predefined array (like https://github.com/nicehash/sgminer-gm/blob/master/kernel/zuikkis.cl), and somewhere just as a direct value.

The crash is immediate, <500ms after loggind the stratum-received job.

Tried to force to SSE2 (-O1 -march=x86-64), but that doesn't help.

It wouldn't be me (qa side) if I did not perform this:

3.8.8.1 - works ... 3.9.5.4 - works 3.9.9 - works 3.9.9.1 - segfault! Introduced here! 3.9.10 - segfault 3.9.11 - segfault ... 3.11.3 - segfault

Something was broken in 3.9.9.1... Please note that regular scrypt (-a scrypt -o ...) is broken as well.

JayDDee commented 4 years ago

My mistake, I was reading it wrong. It looks like it's reporting the parameters correctly. By my math it's 1 GB per thread. How much physical mem do you have? It looks like it's only using 8 GB (6 threads plus OS) before it runs out. Maybe it's failing to use the page file. It's not the same as a GPU miner where VM is mapped but is never actually used. With cpuminer the VM get mapped, allocated and used. It should theoretically use the page file transparently but maybe that's the issue.

Using the page file isn't practical because the disk swapping would kill performance. Unless ther's a math error and it's allocating more than it needs I suggest you stick with 6 threads with alternating affinity, ie 0xaaa or 0x11111111.

The Linux crash will require some in depth troubleshooting. I'm not sure when I'll get to it.

YetAnotherRussian commented 4 years ago

Well, regarding Linux problem I've found the version introduced this, see above (I've edited my comment). I do not have an avx-512 cpu to test if the problem is due to avx-512 fallback out there or the modified intrinsics =(

JayDDee commented 4 years ago

I can't find the stratum url for blockbucket to do a test.

I tried a benchmark test on both versions anf get different results. Both versions work with scrypt default but get a kill signal with scrypt:1048576. Same with current version.

There was a code change to scrypt.c in v3.9.9.1 but it was benign and applied to all algos. Scrypt doesn't use any shared code so it's unlikely it was the victim of a change elsewhere.

I'm not confident v3.9.9.1 is significant. I also found an issue with benchmark for scrypt so the benchmark test isn't valid.

JayDDee commented 4 years ago

Another question, I can probably assume the answer but I ask just in case. Is there anything unusual about your system or procedures and are there any problems with other algos?

JayDDee commented 4 years ago

Another update. I fixed the benchmark issue locally and it now works with 6 threads but gets killed with more. Ryzen 1700, 8 GB RAM, Ubuntu 18.04LTS, gcc 7.4.0.

I'm begining to suspect an issue with your system. The 2 key data points are my ability to reproduce it and your observation of different behaviour between v3.9.9 & v3.9.9.1 with no significant code change. Maybe you can retest that for consistent results.

Whe I have the recipe for the pool I can test myself and that should answer a few questions.

YetAnotherRussian commented 4 years ago

Well, I've tested several algos (the ones I do not have to look for coin wallets, pools etc.):

yespower lyra2z lyra2z330 yescryptr16 yescryptr32 m7m

No issues here. I do not see anything special on the system, I always compile your version and use it w/o any issues, and I do such a things regularly. Ofc I've checked "free -m" to make sure I'm not running on low memory, but 16 gigs should be enough for a single thread :-)

To test scrypt:1048576 on pool, use these:

-a scrypt:1048576 -o stratum+tcp://london.blockbucket.net:3003 -t 1 -u VVbHMJHbfK4RCxbfpayryR7SXs4mek5wJq -p x

I've even tried to disable SMT, precision boost and some cores for this test, but... 1

UPD.: oops, benchmark mode does not cause segfault on any version! Neither for scrypt:1048576 nor scrypt. I do not see any output for a long time, but I see the appropriate CPU load for any version.

Benchmark mode "mines" at least: -t 1, -t 2, -t 4, -t 6, -t 8 - all these work (produce appropriate load)

Random algo example on version 3.9.9.1 @ stratum:

2

Works!

JayDDee commented 4 years ago

OK, I can reproduce it in 3.9.9.1 but not in v3.9.9.

The change to scrypt in v3.9.9.1 was simply how to time the first scan. verium is the slowest algo I've ever seen. That would explain the long time with no logs but not why it now crashes.

Now that I can reproduce it I can dig in further.

JayDDee commented 4 years ago

It's not crashing in scrypt code but stratum code, the same stratum code used by every other algo without problems. And also used by verium on Windows without problems.

Why it happens only on Linux and only on verium is a mystery. Still digging.

YetAnotherRussian commented 4 years ago

Yep, setting up the wrong algo:

-a m7m -o stratum+tcp://london.blockbucket.net:3003 -t 1 -u VVbHMJHbfK4RCxbfpayryR7SXs4mek5wJq -p x

on this pool, crashes the miner too. I thought it's a given pool issue, so I've tried another pool:

-a scrypt:1048576 -o stratum+tcp://mining.xpoolx.com:2100 -u VVbHMJHbfK4RCxbfpayryR7SXs4mek5wJq -p c=VRM -t 1

and got a crash there, too (with any algo set)

I send you the regular scrypt (from asics) stratum pool with working credentials below, maybe you may need them (to compare the things received from stratum: scrypt vs scrypt:1048576):

-a scrypt -o stratum+tcp://scrypt.mine.zergpool.com:3433 -u DP5F6znAugFiNSpWxLSiPnLbSkZ2qS6ui9 -p c=DGB

JayDDee commented 4 years ago

I'm getting inconsistent results in my testing. I suspect a race condition at startup.

The crash occurs because of a null pointer in work.job_id. This is part of a test for refreshing work by looking for a new job sent by the pool. This test has never encountered a null pointer before.

I tried to find the code change in v3.9.9.1 that triggered it and narrowed it down some. It's in cpuminer.c but not finctions stratum_thread or std_get_new_work.

There was a chsnge to a startup job_id check in the miner_threads. The miner threads wait for the first job before they start hashing. When I reverted this change it stopped the segfault. But the idle CPU problem, fixed in v3.9.9.1, returned. The miner wouldn't start hashing until the second job.

That wasn't good enough. The idle CPU problem existed prior to v3.9.9.1 but isn't always obvious unless you're watching for it.

The fix ended up beong adding a null pointer test before deferencing it to compare job_id's. A null pointer is be a positive test result and would force new work.

I don't know why this null pointer issue came up but it seems to produce inconsistent results indicating a possible timing issue or race condition.

The issue of threads and memory is also present in Linux. However I never saw a clean failure, it alwaysresults in an OOM kill. This could be a boundary issue where the buffer get allocated but some other app triggers the OOM killer and the biggest mem user get killed. This is somewhat speculative. I dpn't think I'll pursue thei because there's no benefit. If the system runs out of physical memory you need to reduce threads. It's as simple as that.

I need to do more testing to make sure there are no side effects to the new fix but it should be in the next release.

JayDDee commented 4 years ago

cpuminer-opt-3.11.4 is released. Please retest.

YetAnotherRussian commented 4 years ago

Seems to be fixed, thank you! Tested all the stratum algos I currently have wallets/pools for (algo coverage is around 20-25%).

Btw, do you have any plans on uplifting GCC from 7.3 to... 9.2 at least (I highlight 9.2 but not 9.2.1 as I see 9.2 is used in mingw-w64 for now)? Plenty of changes for 2 years, incl. zen2 support (I no look into the code if they truly accounted for changes in zen2 vs zen1/zen+, e.g. 2x wider FMAs, as some new instructions are not used in cpuminer-opt). Zen2 could also get a separate release binary build.

Thanks anyway!

JayDDee commented 4 years ago

I plan on upgrading my build system to Ubuntu 20.04LTS when it comes out which includes gcc-9.2.1 I believe. Someone has already done some testing with zen2, with no observable change over zen1. I was hoping Ryzen 3000 series would have improved AVX2 but apparently not.

My general philosiphy for Windows binaries is minimalist, only one binary per CPU feature set. The only reason for the zen build is because Ryzen includes SHA., which no other major CPU currently has. Maybe zen3 will improve AVX2 (currently no better than AVX) or add AVX512 which will require a new build.

For now the next planned binary is SHA-AVX512-VAES to support Intel Icelake which should be avaiable for mainstream CPUs sometime this year.

Thanks for finding this. It was a subtle error I still don't fully understand. Due to the inconsistent symptoms it looks like a race condition. This variation of scrypt has the lowest hashrate of any other algo. Being on the fringe it would be most susceptible to marginal timing issues. Ironically the change that broke verium fixed a timing/synchronization issue that affected the fastest algos.

YetAnotherRussian commented 4 years ago

Well, my tests (I had 2700X some time ago, now replaced with 3900X) on the mining side show that there definetely is an improvement on the AVX2 side, but it's still no good benefits from the SHA. I've already tested march=goldmont-plus builds of cpuminer-opt, and this crappy cpu have fantastic benefit on using this, while the ryzens benefit several precent (in the best case, and the worth is that sse2 build is even faster, or same). Zen3 should add a shared L3 cache per all CCD, so the typical mining usage "mine+work", "mine+play games", "mine+use virtualizatiuon" would suffer a lot because of cache poisoning. But yeah, 100% gamers should get some better fps due to lowered latencies. We're not among 'em =(