Open madscientist159 opened 6 years ago
xmr-stak has never supported anything but amd64/x64
xmrig forked a while back and added x86 (32bit) and ppc64 and maybe neon but none of that ever got backported, mainly because nobody to support it and nowhere to test it.
@Spudz76 I do support ARM 64/32bit (NEON) but I was never support ppc64, as you rightly noted nowhere to test it.
I'd be happy to provide long-term access to a free test box if it would mean official support. I'm going over the assembly files now to see what's required to port the v8 algorithm as well.
Assembly is not required, it optional part to increase performance, algorithm also implemented in C++.
@xmrig Looking into your build system now. With the hard fork coming up the forked xmr-stak is going to become useless for both myself and a bunch of other miners real quick, so if xmrig can be brought up with C++ first then tuned over time that's probably what I'll be doing.
Something interesting about ppc64 systems is that one almost never drops to raw assembly; gcc / Clang provide vector intrinsics that are sufficient (in fact, trying to hand tune assembly beyond the level of using / tuning the intrinsics normally causes a performance drop -- the compilers are quite good for ppc64 targets).
If I got you access to a ppc64 test box, is this something you'd be interested in supporting?
What is the hashrate of an Power8 or Power9 for cn_v7?
@psychocrypt For a typical dual 18 core server, hash rate is 4KH/s. For the higher end dual 22 core boxes, hash rate normally exceeds 5KH/s (depending on how much electrical power one wants to throw at mining, mostly).
Neat, thanks for the corrections. I thought I saw ppc64 CN code in something somewhere... maybe one of the reference C++ implementation repos
Unfortunately CVn2 is running at about half the speed of CNv1 on the POWER processors. Massive amounts of time are wasted in vec_sqrt; is there an alternative that is known to work better for non-x86 CPUs?
EDIT: Even the ARM fallback of non-vectorized sqrt() is yielding terrible results. Did Monero really just lock mining to arguably backdoored / user hostile ASICs (Intel / AMD CPUs?)
The hand modified ASM was an important piece of performance on those CPUs also
Normal sqrt is a bit heavier than what really needs to be done (shortcuts - integer sqrt with low accuracy is much faster than true float with real accuracy)
Same was a problem for CUDA and for that matter OpenCL. Everyone expects people to do important math with sqrt whilst this CNv2 can be close enough for hand grenades (rounded hard). Nobody has a builtin quick-dirty-int-sqrt.
@Spudz76 I've been poking at the assembly since I figured something like that would be in play, but even with some initial tuned assembly (basically forcing xssqrtdp instead of the vectorized version) I'm only seeing ~800H/s on a machine that used to do 1400H/s. Perf shows the only hotspot as xssqrtdp.
For reference, xssqrtdp is described as "VSX Vector Square Root Double-Precision". Looking at the x86 assembly for Ryzen, a very similar instruction is used (sqrtsd).
POWER does in general have a weaker FPU than x86 (and stronger I/O / AES), but this type of cut seems extreme IMO.
GCC says it has some intrinsics for a few other sqrt/div functions but they may not be much faster.
These built-in functions are available for the PowerPC family of processors:
float __builtin_recipdivf (float, float);
float __builtin_rsqrtf (float);
double __builtin_recipdiv (double, double);
double __builtin_rsqrt (double);
They still operate on float/double while the other works on integers which fit in the registers, thus fast. I think PPC still use ALU (math coproc) so the float calls probably stack up on that, and it can't do the sqrt and div parallel. The AMD/Intel get their speed back by being able to predictive execute the div and sqrt simultaneously and without memory access. That penalizes ASIC permenently as they can't do parallel speculation - maybe that hits PPC also. It should be even slower if it was hitting memory, so it must still be in cache but not in registers and the div has to wait for the sqrt to complete (causing wait states that don't exist on x64)
@SChernykh would be much better able to help I am regurgitating my understanding of CNv2
@Spudz76 I agree that we're not going out to main memory, otherwise the perf trace would be showing hotspots on the register load / store (been there, done that with CNv1). POWER does have speculative execution though, so I'm not quite ready to buy the lack of speculation argument.
Do we actually need all 64 bits of precision on the result?
Footnote: It's really a shame that if we can't fix the performance, that this is going to encourage locked / closed ASICs (AMD/Intel CPUs) to be used to mine and use a privacy-sensitive cryptocurrency instead of open / owner controlled CPUs like POWER. There were other ways to check for the presence of a proper FPU other than making CryptoNight into a sqrt() microbenchmark.
But the thing is, ASIC get hit with 4x to 12x performance kill. You only have 2x so it can probably be optimized back to the 0.15x loss or so other stuff sees. Either way you still win over ASIC, just not as much win as Intel/AMD cpu or GPUs. And after fork everyone loses 0.15x so its all the same.
Also check out the OpenCL implementation of the sqrt+div trick it may be clearer.
@Spudz76 Oh? I thought the other Intel / AMD stuff etc. was at 1x? If everyone else is at 1.7x that's a different story altogether.
In any case it does look like the fast inverse trick might be a way out of the mess. It looks like it provides enough precision for CNv2 but I still need to write a vectorized version to see what the actual speedup will be (currently running quick A/B tests with the scalar variants).
edited - had my math backward they lose 15% avg except some GPUs actually are 0% or slight speedup
overall the global hashrate will drop some but mostly it was a trade of 15% for 400%-1200% kick in the ASIC nuts (making them permanently unprofitable or at least on par with just running GPUs at worse cost)
also noticed this but it is not CNv2 yet? idk if maintainer still does things - it was updated for current CNv1 (aka "v7") at least, and lost 9% performance from the previous original CNv1 with no extra-tweak ("Monero Classic").
@Spudz76 The maintainer seems to be gone or lost interest, hence why I've been attempting a port directly. Problem is while the ASICs were kicked off the network, so were the only halfway decent non-vendor-controlled CPUs (assuming there isn't any way around the FPU performance problem). Not sure that was a great tradeoff, seems to me just exchanging one kind of ASIC for another.
@Spudz76 Oh? I thought the other Intel / AMD stuff etc. was at 1x? If everyone else is at 1.7x that's a different story altogether.
In any case it does look like the fast inverse trick might be a way out of the mess. It looks like it provides enough precision for CNv2 but I still need to write a vectorized version to see what the actual speedup will be (currently running quick A/B tests with the scalar variants).
The vector variants of the fast inverse don't have enough precision to make this work. The only way to speed things up at this point that I can see would be to somehow shove a second square root operation through at the same time as the first; as it stands, half the vector unit is doing nothing during the square root operation.
@madscientist159, nioroso-x3 ported 2.5.0 to ppc64 yesterday - https://github.com/nioroso-x3/xmr-stak Performance is down for my 22 core Power8 from 4300H/s to 2600H/s for Monero7, that's a pity :(
@Balzhur How did you test it? On killallasics.moneroworld.com? Old algorithm should be just as fast as before, new algorithm is slower, that's normal.
@SChernykh, on Nicehash. No changes to config compared to 2.4.7, i.e.
2.4.7
{ "low_power_mode" : 2, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 2, "no_prefetch" : false, "affine_to_cpu" : 1 },
2.5.0.
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 0 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 1 },
@Balzhur 22 cores doing 2600 H/s is ~118 H/s per core - still much better than Intel/AMD can possibly do. Even AMD Ryzen @ 4 GHz does only 94 H/s per core.
@Balzhur The new algorithm was designed to remove advantage of any hardware with fast on-chip memory. No surprise that it reduced unbelievably fast 195 H/s per core to 118 H/s per core on Power8.
@SChernykh, well... before those same cores were doing 4300 for the same algo. And even before they were doing around 6000 for pre-fork monero, but that's a different story cause algo did change.
You should not compare Power to Intel/AMD without considering prices... Power is way way way more expensive.
@Balzhur I'm just trying to explain why Power got hit so hard by this change.
@SChernykh, да я понял, но это печаль :(
@SChernykh, after the switch to V8 (nicehash) performance went down further to around 1587 H/s which is 72 H/s per core :(
@Balzhur Try my xmrig fork -- it'll go back up to near CNv7 levels.
@madscientist159, gladly, tried to find it yesterday, but was not able....
@Balzhur sorry about that, link is https://github.com/madscientist159/xmrig
Be sure to read the instructions on performance -- the tuning is a bit different than CNv7. You should get over 100H/s per core.
@madscientist159, don't be sorry, it was not published yesterday :) What do you recommend for SMT and thread number per core? Right now I'm running SMT=2 and two xmr-stak threads per core, so 44 threads for S824 machine (22 cores available for Ubuntu LPAR). These numbers provide best performance for my config with xmr-stak.
PS: let's move to your repo issues cause it's not relevant here anymore...
@Balzhur I was more apologizing for not having it up sooner -- was trying to eke out a few percent higher performance and didn't get it cleaned up and posted as soon as I wanted.
It sounds like your existing tuning might be basically right on track for my xmrig port, actually. I was using a slightly different tuning on CNv7 and for CNv8 my tuning (as mentioned in the README :wink: ) is SMT4 with 2 threads in powersave=2 mode.
EDIT: Actually, give me a sec...that might be wrong. Checking settings again now.
@madscientist159, could you please open "Issues" section for your repo? I'm not able to comment there :)
@Balzhur Yeah, opened. So I'm trying to figure out now why it's choking with all cores when it was working fine before on a smaller machine. Might have to tune a bit further still.
Just as a follow up, I'm seeing horrible numbers on the larger devices now too. I don't know what I did wrong yet, I'll need to go back over the code and see if I can pull out one of the intermediate development versions that was giving better (but not anywhere close to parity) performance.
Bottom line: the Monero developers tuned CNv8 to only work well on Intel and AMD locked processors. A strange choice for a privacy focused coin to say the least, but nothing that can be done about it at this point other than to mine something else.
I still don't get what you guys are on about with who makes what processor and how it can be any less secure as far as the blockchain is concerned. Any links to learn me?
@Spudz76 Not sure how much you know already, so here's an article that goes over things from a higher level perspective: https://chiefio.wordpress.com/2017/02/03/for-deep-security-use-arm-avoid-intel-amd-processors/
He recommends ARM, but a.) most ARM vendors have zero commitment to owner control and b.) ARM isn't really powerful enough to replace Intel / AMD machines for most use cases. That's where POWER comes in, but it seems that CNv8 does a pretty good job of discriminating against the POWER machines :cry:
My question to anyone that thinks this just hits mining is: do you (generic "you") have a separate computer running on a completely different architecture that handles all of your financial transactions on Monero, or are you more likely to use one of your already synced mining rigs? Or a spare machine that isn't mining per se but is still built on these CPUs? Remember that we're still dealing with people using Windows (!) for convenience....
I personally use Raspberry Pi 3 both for blockchain and hot wallet. Works perfect.
@SChernykh How did you handle the built-in wifi and the binary bootloader? Or are those not a concern in your application?
In any case, glad to see someone else that doesn't trust the mining hardware with their wallet :+1:
EDIT: This does bring up an interesting question though: If it is understood that the mining hardware is not safe for many uses outside of mining, how is it (philosophically) any different than requiring people to purchase an ASIC to mine? Part of the appeal of CPU / GPU mining was that you had access to powerful computing hardware that you could use for other things when not mining; locking mining to select CPUs and GPUs that I would never buy otherwise feels more like using an ASIC than before.
@madscientist159 I don't use wifi on it, it's connected via Ethernet cable. Wifi can be hacked.
Edit: there is no problem in using mining hardware for other things. You just need mining software and your everyday stuff on separate bootable disks.
@SChernykh The WiFi hardware is still present, and you have a binary bootloader that has access to it. How do you know it's not partially active? (FWIW on RPi v3 boards on this side, we literally remove the board-level antennas per policy due to that issue).
On the latter point, I don't want the everyday stuff running on a ME/PSP enabled machine. That's the whole problem here; we can no longer reuse the modern CPUs that we are using for other tasks to mine in the off hours. :smile:
Wow, sounds more like the whole "terrorists everywhere" panic has spread to digital landscapes too.
Literally avoiding a fly's sneeze among hurricanes.
@Spudz76 So you're OK with the fact that your system vendor has every right to control what you can and cannot do with the machine, and likely has left some nice back doors open bugs in place independent of the OS for which they have zero liability? I know some people are fine with this, as they have determined inexpensive hardware is more important for their use case than data security, but I also know that many security-conscious folks find this to be a major problem. Of course if one is on Windows anyway, then I agree -- why worry about the small back door when the main barn door is wide open... :wink:
Some additional reading of practical consequences of the AMD / Intel DRM-focused CPU design: https://www.theregister.co.uk/2018/08/29/intel_jtag_flaw/ https://www.blackhat.com/docs/eu-17/materials/eu-17-Goryachy-How-To-Hack-A-Turned-Off-Computer-Or-Running-Unsigned-Code-In-Intel-Management-Engine-wp.pdf https://www.tomshardware.com/news/intel-me-cpu-undocumented-manufacturing-mode,37883.html https://hardware.slashdot.org/story/18/01/07/2015226/after-intel-me-researchers-find-security-bug-in-amds-sps-secret-chip-on-chip https://thehackernews.com/2015/08/lenovo-rootkit-malware.html https://www.welivesecurity.com/2017/10/19/malware-firmware-exploit-sense-security/
Perfectly fine with it as a possibility because they can literally not get to the network because firewall.
Might as well remove the main security hole, and disconnect the power cord. That increases security to 100% whereas all this paranoia does nothing but scare people. Note 100% secure also means 100% useless. Always a trade-off and 0.01% chances are not worth worrying about. Exactly same as the PATRIOT act was in the first place, a bad idea against a highly unlikely event fueled by fear. What-if is not a game anyone should bother playing, you always lose. Remember, terrorism works better the more scared you are and the less anyone actually needs to attack anything - the WHAT-IF they DO is the terror, not people dying. They won the second we did the PATRIOT act...
The ME is not for spying and even if it were, nobody is in mine. Unless I also have a Huawei router with Chinese holes in it, too, but I don't. You would have to get backdoored like 4 ways and have them all line up, to have any actual leakage. Most of my miners have IPMI on top of ME and all that other tech, absolutely unconcerned 110%.
I also disable all Spectre and Meltdown mitigations, stay out of my way, I know nothing is sniffing me (or my memory) on my own sandboxed firewalled system that sits there and mines, and disabling side channels is overkill unless you're operating Amazon S3 or similar products (with hostile users galore and actual VM ops going on within the same memory and CPU).
I'm still interested in PPC64 getting improvements for the new algo, though. I agree on supporting your mission, just not that this avoidance of Intel/AMD is worth anything but tail-chasing.
@Spudz76 For what it's worth I did give some examples of how this has already been used, including the high profile Lenovo case where the malware was shipped, unremovable, inside the locked / signed UEFI firmware. It's not just about the ME / PSP, it's about the current state of the PC ecosystem that is steadily removing owner control. Part of what the firmware increasingly does these days is restrict what OS and software is able to be run, including a recent case from last year where a new Lenovo machine would only ever run Windows because of firmware checks (no way to add new keys, vendor said it was only allowed to run Windows).
Final thought: Windows already flags miners as malware and tries to block them running. For now Microsoft allows you to override this, but is this complete reliance on vendor largess healthy for cryptocurrency long-term? That's why I bother with owner-controlled computing. :grin:
I see. I heard of some Huawei style sniffer modchips being added to Supermicro boards somewhere between when they left Supermicro and when they get to a customer.
They literally can't stop me from running Win7 though other than by FUD. Heck I was on XP until games and drivers stopped supporting it and every other app started requiring dotNET garbage newer than what M$ allows on XP. I will never ever touch Win10, so I agree on the overreach and not supporting that. But I do not feel insecure, I always find a solution or way around whatever "the man" wants to force. I do what I want.
@Spudz76 On existing hardware, sure, you can't be stopped since the board did that from day one and the older boards don't have the same lockdown as the latest ones (efuses in the ME/PSP). The problem is going forward -- we're at the point where the technology is coming into place on the PC side to offer two different versions of a PC, one that is locked and cheap (where you can't run anything the vendor doesn't want aside from factoring a large cryptographic key), and one that is comparatively expensive and sold for "professional" work only.
The only thing preventing this switch right now is economics, since there are still competitors to the Wintel ecosystem waiting in the wings for just such an overstep. Once those competitors drop away, what does market economics say will happen? (Look at Google and the power of "free stuff" in exchange for lock-in and data mining for an example -- Microsoft has already started down that route.)
Basically, the technical foundation cryptocurrency was built on is shifting. If we all just go along for the ride with (currently) cheap and relatively unlocked x86 PCs, there will be a very rude awakening at some point down the line when (for instance) bans in repressive states are suddenly enforced at the hardware / firmware / OS level. At the very least branching out from the existing x86 duopoly is going to be vital to long term survival of cryptocurrency IMO. :smile:
I have never seen an undefeatable piece of technology, so I say BRING IT, it will be broken, or suffer very horrible adoption/sales rates. The market decides, not Intel or AMD.
Some PC's can run MacOS which was supposed to be impossible too, takes some decent firmware hacking but in the end, it works and is "not allowed" as hard as Apple could not allow it. There will always be workable solutions, unless the manufacturer changes to rental-only format whereas they legally retain ownership of your computer and you only borrow it. Or put their firmware in ROM that can not be desoldered? But they won't because then it's not field upgradable, if they make ANY mistakes in design and release it the entire product (and all the R&D and tooling investment) is garbage and they have to roll out brand new units with the patched ROM burned in (and then hope they did it perfect this time). It will never work even if it's where they would like the market to go. It's technically infeasible mostly thanks to humans not being perfect - there is always some vector someone forgot about, and will always be.
Nobody is going to go along with a backward move like corporate-owned end-user devices (bell system and their original telephone units, or USPS and how they own every mailbox regardless if you bought it yourself at a store, etc)
Also they already release tech artificially slowly and in tiny increments so that they can profit every time people upgrade. That is worse for everyone than any DRM, we should have 1THz stuff and be colonizing space by now, but instead we intentionally slow down progress so each step can be fully sucked dry of cash.
There is no official support for ppc64el systems. A port was already made for CNv7 (https://github.com/nioroso-x3/xmr-stak) but CNv8 appears to require brand new assembly support routines.