JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
765 stars 543 forks source link

[v. 3.18 only] First periodic report may not be correct #337

Closed YetAnotherRussian closed 2 years ago

YetAnotherRussian commented 2 years ago

Startup, then a bunch of low diff shares (using -t 8 and avx2 build):

изображение

then the first report with years and sub-zeroes:

изображение

This one does not have 100% reproducibility, but a simble batch "script" with for loop and 15s timeout (then kill the process) should help to reproduce. This happens somewhere near 1 launch out of 5.

JayDDee commented 2 years ago

The period was only 5 seconds, that's why the data is invalid. The early report was because stratum diff changed and that changes all the calculations to I have to start a new period. You can confirm this by observing when the problem occurs vs when it doesn't. The reference hash rate was zero, that may be significant.

This could be a race condition. I'll look into it. However any fix may be limited to avoiding negative values. If the reference hash rate is zero there is no sample yet so the data will still be invalid.

Edit: I'm sure 12 shares in 5 seconds also contributed to the wonky stats.

JayDDee commented 2 years ago

I think I've found the problem. The start timestamp wasa not initialzed properly which resulted on a negative time. That negative time was then used in many calculations and turned them negative as well. The "correct" display should have been all zero which isn't any better.

All stats are very volatile at startup. Early activity like a couple of quick shares found, or an unusual delay until the first share is found, or a short stats period will increase the volatility. So the only thing I can really fix is the negative sign.

JayDDee commented 2 years ago

You seem to find the weirdest problems but this one exposed a legitimate bug. Very low starting difficulty caused a flood of shares to be submitted. This triggered stratum to raise the difficulty. The current stats period is always terminated when straum diff changes resulting in a 5 second period. The combination of high share rate with a very short period with no sample hash rate yet available was never going to be pretty.

Before I release a fix it do you have any more?

YetAnotherRussian commented 2 years ago

Before I release a fix it do you have any more?

Not yet. I've just deployed your new version, so it will take some time to find issues, if any. The one above was found immediately... Feel free to release. If you need an env to test the fix on your side, use http://p2p-spb.xyz:6002/static/ as a source of low diff jobs (and any algo from cpu-mineable coins out there - third line of coin switching buttons). Coin addresses (connected miners) are non-trimmed plain text there, so you may use 'em.

JayDDee commented 2 years ago

If you're deploying in numbers that explains why you find the most obscure bugs.

I was also waiting for feed back on Ryzen SHA vs AVX2 for those algos that can do sha256 either way, such as scryptn2. I don't know what your favorite algo is but if it includes sha256 and you have any Ryzens you should try the AVX2 build to see if you get better performance than using the native build with SHA.

If AVX2 is consistently faster I'll change the defailt to use it instead of SHA on Ryzen CPUs

YetAnotherRussian commented 2 years ago

If you're deploying in numbers that explains why you find the most obscure bugs.

Not only, I'm on the QA side for years (but sadly newbie in C/C++, data encryption and strong math - I write in Go, shell script, Windows batch, AutoIt... and JS/CSS/SASS/HTML for personal projects). Those PCs belong to me (not some illegal stuff like botnets etc.)

I don't know what your favorite algo is but if it includes sha256 and you have any Ryzens you should try the AVX2 build to see if you get better performance than using the native build with SHA.

As for Ryzens, I got 3400GE (Zen+) and 5700GE (Zen3). I do not have those 32/64Mb L3 cache models. Will test If I'll be able to find sha256d/sha256t pool with low difficulty (not for ASICs). Benchmark mode should work as well.

JayDDee commented 2 years ago

sha256d would not be a valid test, it still uses the old code because I haven't been able to test the new code. sha256t is good, I think it's still on zpool. Scrypt also uses sha256 and scryptn2 (N=1024**2) is a CPU algo available a a few popular pools. Sha256t only plays a small part in that algo but I saw a difference between SHA and AVX2 on an r7-1700.

I did all my professional work using a proprietary system, where the compiler, OS and application were tghtly integrated. I did emergency recovery (24/7 5 minute response), first level HW & SW support, Second level SW support, 3rd level SW design support and some pure design as well. I've seen a lot of bugs. But I'm also new with c/c++ but much more comfortable on Linux than Windows.

With your QA background you may have heard of the IBM black team of testers back a few decades ago. These people were sadistic in their testing and were feared, The black lab coats made them even more sinister when everyone else wore white ones. Their attitude was to try to break the software any way they could, I try to take the same approach but without the drama.

If you haven't heard of them you can find lots about them on Google, you might find it interesting.

YetAnotherRussian commented 2 years ago

you may have heard of the IBM black team of testers back a few decades ago. These people were sadistic in their testing and were feared, The black lab coats made them even more sinister when everyone else wore white ones. Their attitude was to try to break the software any way they could, I try to take the same approach but without the drama

Yeah, it's a well-known story :) But in terms of mining software, the best targets are performance, portability aka cpu dispatching, error handling, usability and correct stats. I don't know why those GUIs exist, or why should someone try to paint "GUI" in console with ASCII symbols. That work is just useless, while those commercial miners have serious issues with network error handling (like t-rex), log fake stats to justify taken fees, use brainscrewing CLI parameters and connect to suspicious third-party hosts. The only thing I do not like in cpuminer-opt is the lack of working .sln. I've upgraded different cpuminer-multi versions to msvc2019 before, and I know what a pain is to do that now (fix refs, integrate subrepos, rebuild all libs and fix issues in 'em, rework special gcc things to support msvc and/or icc, etc.), so I've forgot about those thoughts. In our days WSL works pretty well, as well as virtualization does. So I prefer to build in Linux rather than using old (mingw) or newer (tdm64) gcc versions in Windows.

Will test "avx2 vs sha" on those two Ryzen machines in 1-2 days. Sadly I don't have an Intel CPU with a proper SHA support (12900K, 11700K, 7900X, 10900X, Silver/Gold/Platinum etc.) to test on. I only have this one with SHA:

image

But it is used for TV and is slow... from any side.

YetAnotherRussian commented 2 years ago

Bench mode is affected as well:

image


Ryzen 7 5700GE (cpuminer-opt v. 3.18.0, Win10 x64, mem 3800MHz CL16, dual-channel) "cpuminer-avx2.exe -t 1 --cpu-affinity 1 -a scrypt:1048576 --benchmark" --> 7.92 H/s avg "cpuminer-zen3.exe -t 1 --cpu-affinity 1 -a scrypt:1048576 --benchmark" --> 5.80 H/s avg "cpuminer-avx2.exe -t 1 --cpu-affinity 1 -a sha256t --benchmark" --> 7020 kH/s avg "cpuminer-zen3.exe -t 1 --cpu-affinity 1 -a sha256t --benchmark" --> 17.15 MH/s avg - yay! is this correct? "cpuminer-avx2.exe -t 1 --cpu-affinity 1 -a x16r --benchmark" --> 322 kH/s avg "cpuminer-zen3.exe -t 1 --cpu-affinity 1 -a x16r --benchmark" --> 334 kH/s avg "cpuminer-avx2.exe -t 1 --cpu-affinity 1 -a yescryptr8g --benchmark" --> 1112 H/s avg "cpuminer-zen3.exe -t 1 --cpu-affinity 1 -a yescryptr8g --benchmark" --> 1030 H/s avg "cpuminer-avx2.exe -t 1 --cpu-affinity 1 -a yespower --benchmark" --> 350 H/s avg "cpuminer-zen3.exe -t 1 --cpu-affinity 1 -a yespower --benchmark" --> 340 H/s avg "cpuminer-avx2.exe -t 1 --cpu-affinity 1 -a myr-gr --benchmark" --> 1850 kH/s avg "cpuminer-zen3.exe -t 1 --cpu-affinity 1 -a myr-gr --benchmark" --> 2050 kH/s avg

JayDDee commented 2 years ago

"cpuminer-zen3.exe -t 1 --cpu-affinity 1 -a sha256t --benchmark" --> 17.15 MH/s avg - yay! is this correct?

That's the speed I get with AVX512, 7 MH/s/thr is what I get with with AVX2. I'm not sure what's going on here. Ediit: AVX512 is more than 2x AVX2 because it benefits from ternary logic instruction, a fascinating instruction for a software guy.

myr-gr test is not valid because zen3 also has VAES which is used by myr-gr. That's the only other one where AVX2 was slower.

There's another subtle difference. Sometimes SHA competes with AVX2 8way, sometimes with 4way, it depends on the structure of the algo. If it has to work with 512 bit hash functions that use 64 bit data (64 bit arith mostly) those functions are limited to 4way with AVX2. Sha256t uses 8-way sha256 with AVX2 , while chained algos like x16 would use 4way for any 256 bit hash functions because most of the other functions are 512. Against 4 way, SHA woulld win.

Thanks for the test, it confirms my data and the default will be changed in the next release to prioritize 8-way sha256 over SHA. There are very few instance where sha256 4 way used with AVX2, x22i & x25x, but they will continue to use SHA when avaiable.

I'm not sure how cpuminer would work on a Goldmont with SHA, never tried with or without SHA. It would need to be properly compiled, and I might have to tweak the feature selection code at startup but I don't see any serious issues, other than it's a very small/weak CPU.

JayDDee commented 2 years ago

I found a performance issue with SSE2-AVX for scryptn2, it's now slower than before. It seems it's because of repeated bit rotations. They're fast on AVX512 but require 3 instructions before AVX512. The legacy code had the sequential rotates interleaved to reduce dependencies. I missed that. I need to do some more work before the next release.

The problem isn't noticed on AVX2 because the legacy AVX2 code is in fact slower than the legacy AVX code, so AVX2 still showed a net gain in v3.18.0. Another surprise discovery.

YetAnotherRussian commented 2 years ago

I don't see any serious issues, other than it's a very small/weak CPU.

Well, I don't think someone should use Atom (let's talk honestly) CPUs for mining, that including E-cores in new Alder Lake CPUs. Yep they claim vector extensions support in those, but having x264 encoder in mind we know that some CPUs have fake aka slow instruction support like the case mentioned @ line 139 here - https://github.com/mstorsjo/x264/blob/master/x264.h, and those help not that much. Should be pretty interesting to see AVX2 performance @ Alder Lake in case of dropped AVX512F support.

JayDDee commented 2 years ago

Well, I don't think someone should use Atom (let's talk honestly) CPUs for mining

There's a case to be made for power efficiency by using many small power efficient CPUs instead of a large power hungry one. But scaling is a problem because you can't build a rig with a dozen CPUs like you can with GPUs.

JayDDee commented 2 years ago

cpuminer-opt-3.18.1 is released with a fix to the negative stats in a premature summary report. It was due to initializing the start time to zero instead of the current time. If the first summary report was generated before the proper start time is set it would result in negative time and negative values that use the negative time in their calculations. I also fixed a potential divide by zero under similar circumstances.

I should note the delay in setting the actual start time is to allow the stratum connection to be setup, the miner threads created, and receive the first job before hashing actually starts. Including this dead time would result in inaccurate hash rates.

JayDDee commented 2 years ago

Should be pretty interesting to see AVX2 performance @ Alder Lake in case of dropped AVX512F support.

That's really lame. They disabled AVX512 because the ecores don't have it and the OS can't deal with different architectures. I was wondering how they would solve that problem, obviously they didn't.

AVX512 has been a bit of a bust due to the many years delays, and poor performance. Some operations scale negatively with the bigger vectors. For example creating a vector constant takes 2x+1 instructions for a 2x sized vector.

Now Intel has opened the door for AMD to jump ahead with a desktop with AVX512, SHA and VAES before Intel. That hasn't happened since AMD beat Intel with a 64 bit desktop CPU.

I think Alderlake is a loser architecture. It's only benefit seems to be to reduce lag under brief heavy loads. It's the kind of thing that annoys users. It might be useful on a laptop where you trade battery time for less annoyance. But during light loads the pcores are deadweight, and the ecores are useless during heavy loads. It can't seems to perform any kind of task particularly well. A poor compromise IMO. Give me all ecores or all pcores and I'll choose the one that's better for my needs.

YetAnotherRussian commented 2 years ago

I was wondering how they would solve that problem, obviously they didn't.

Seems this should be solved in a server platform (not sure about workstation CPUs though, as they are just a desktop copies). We'll see,

It might be useful on a laptop where you trade battery time for less annoyance.

Well, as I'm doing not that bad with hardware, I've tested a bunch of laptops before buying one... My tests show that even HT/SMT is totally useless in a 10-15W TDP package. It helps in benchmarks (to sell that hw, so to speak), but never in real life (that incl. some mad scenarios like CPU mining @ laptop). Extra cores and HT/SMT leed to a lower base clocks (it takes time to go from base to boost clock, and this causes very bad performance in some cases like compilation of a big .NET project) and higher energy consumption. In case of HT/SMT, L1 and L2 are shared between cores as well. Cheap laptops do not have a descent cooling system for a something like 6cores/12threads. After all, I went for a Ryzen 3 4300U laptop, as it has 4c/4t and a base clock of 2.7GHz. The system has pretty good performance and holds boost clocks for a long period, while being able to live up to 13 hours from battery. Newer ones (5300U) got SMT, which should help to drain battery 20-25% faster, that's it (and benchmarks, of course).

cpuminer-opt-3.18.1 is released with a fix to the negative stats in a premature summary report

Will test soon.

YetAnotherRussian commented 2 years ago

Seems to be fixed, thanks.