Open fxcoudert opened 3 years ago
Thanks for reporting this! I currently have no plans to implement ARM64 assembly (which would require a rewrite of the existing ARM assembly) so for now I'll try to disable the ARM code for ARM64.
Unfortunately I have no way to try this on Apple, so could you please verify that this works and close the issue if it does?
Cheers!
The target triplet is sometimes called arm64
, but most of the times referred to as aarch64
. I think it would be better to match both.
Yes, it might be, and I would certainly have to add that if I implemented 64-bit ARM code.
But for now the purpose is to just not trigger the A32
code, and aarch64
would not do that.
At first I tried to create a pattern that matched arm* AND NOT arm64*
but couldn't figure it out, so this hack would have to do for now.
Well, the plot thickens!
I just got a report that an Apple Silicon build failed because the host was reported as arm-apple-darwin20.2.0
. That same system had reported aarch64-apple-darwin20.2.0
previously and of course that worked (and always should have worked because there's no arm
in it at all).
@fxcoudert You obviously had a failure and posted that your host was aarch64-apple-darwin20
, but actually that should not have triggered the ARM assembly code. But then I assume my fix to ignore arm*
if arm64*
matched fixed your problem, right?
In any event, if the host CPU on Apple Silicon can sometimes be simply arm
and sometimes arm64
and sometimes aarch64
, and, who knows, maybe sometimes armv64
or armv8
, then I'm going to have to make my ARM assembly code trigger on NOT arm*64* AND armv6* OR armv7*
.
Being too picky on asm choices is better than being too loose because in the worst case we still build, and these days processors are so much faster than when I created that asm code that it probably makes no practical difference.
@dbry I stated what the usual/canonical triplet was for the architecture, i.e., what an up-to-date config.guess
will report. The config.guess
in wavpack 5.4.0 is dated 2018-02-24, so it is outdated (and the config.sub
too). If you update those config files, the arch triplet should be detected in a systematic manner.
We have fixed the issue in Homebrew by explicitly passing --disable-asm
to configure
.
Ah, that explains some mysteries! I wish I’d known I was going down the wrong path before my release, but unfortunately the whole autotools
component is a black box to me and other than a few trivial (and probably wrong) tweaks to configure.ac
, I rely on the kindness of strangers.
The latest autotools-dev
for my Debian-based distribution is 20180224
. This page suggests that just getting the new versions of config.guess
and config.sub
and putting them in the tarball is fine (or maybe into /usr/share/misc
). But this page seems to imply that it’s not a good idea, but most of that page is Latin to me and it also seems it might be a little too targeted at Debian packages.
Anyway, I put --disable-asm
in there for a reason.
Just update the config.guess and config.sub files manually at https://www.gnu.org/software/gettext/manual/html_node/config_002eguess.html says, it's good.
If anyone kept the old MMX intrinsic code for x86 (or any intrinsic code), there might be a way to use them on a couple other architectures with https://github.com/simd-everywhere/simde. The mmx.h implementation has redirections to Neon and Loongson MMI (in some ways a carbon copy of MMX on MIPS).
[oh no i am now curious about whether the C compiler will unroll and make its own simd]
The answer is no. I grabbed the case 17
loop from decorr_mono_pass
and fed into godbolt. Neither GCC nor Clang wants to vectorize with #pragma omp simd
+ -fopenmp-simd
. GCC says dpp->samples_A [0] = apply_weight (weight_A, sam_A) + bptr [0];
line is the problem. Yeah, loop-carried dependency right there...
I guess that would be why the assembly functions don't use the dpp
samples. There's also always a dependency on weight_A
and weight_B
for each iteration, which would explain the use of MMX and all the optimization being stereo-only -- you really could not go beyond one-pair-at-once.
In other news, asm isn't doing much in my very unscientific test -- bash time
on the fast-tests
. The user
CPU time variance is everywhere so don't take it too seriously anyways.
If this is still an issue can something like this be added to your CMake configuration?
In other news, asm isn't doing much in my very unscientific test -- bash
time
on thefast-tests
. Theuser
CPU time variance is everywhere so don't take it too seriously anyways.times
upstream real 4m39.633s user 3m29.828s sys 0m2.749s real 4m39.962s user 3m48.265s sys 0m1.640s simd no-asm real 4m58.920s user 3m26.483s sys 0m2.515s real 4m57.772s user 3m44.765s sys 0m2.250s ups no-asm real 5m1.402s user 3m47.125s sys 0m2.827s real 4m57.656s user 4m14.796s sys 0m1.937s
- "upstream" is the current HEAD.
- "simd" is a private commit that adds #pragma omp simd to the two functions with an assembly version, on each case xx: for loop.
- no-asm is --disable-asm, so no MMX.
Ryzen 5 3600, auto-boosted to 3.99 GHz. MSYS2 (not the most performant environment), gcc 11.3.0, default generic flags besides -fopenmp-simd.
Thanks for this post and your investigation...sorry I didn't respond earlier but I kept forgetting it was here!
The funny thing is that I recently switched from an Intel-based laptop to a Ryzen machine (HP Dev One) and the advantage of the MMX code seems to have completely disappeared, and this is exactly what you are seeing! It usd to provide a decent bump, but I don't have the expertise (or motivation) to figure out what's going on. Perhaps moving the asm to something newer than MMX might help, but like you point out, the only thing that can be SIMD'd is the stereo pairs. Such is the nature of codecs that adapt every sample.
Of course, what does make a huge improvement is the multi-threading I added recently, and that's the direction I'm more inclined to pursue in the future.
Thanks again, and sorry for being so unresponsive! :smile:
If this is still an issue can something like this be added to your CMake configuration?
I don't know. Is this still a problem? I was under the impression that with the latest changes and new config.guess
fixes everything (i.e., disables ASM when it needs to). @fxcoudert ?
We are not updating often submodules and IDK if CMake is working together with Autoconf in your project, we had issues with our checkout and solved that way our side, so I saw this issue still open and asked :)
Ah, well the issue is still open because I'm lazy about closing them and hope the OP will do so. Thanks for posting and unless I here otherwise in the near future that this is still an issue, I'll close it myself. Cheers!
wavpack 5.3.0 fails to compile on Apple Silicon (aarch64-apple-darwin20) because the ARM assembly is not recognised by the system compiler (Apple's clang 12):