Open ralfbiedert opened 6 years ago
Hey, thanks for taking the initiative. Porting it is definitely going to be an effort, but doing the first port should make subsequent architectures much easier.
There are a few things we need to do to get faster running (with SIMD) on aarch64:
#[cfg(target_arch=...)]
all of the vektor::x86...
imports, and possibly gate all of the target_feature
as well (can probably be done with sed)To get it working without fallbacks, we would need to:
Thankfully, I think NEON's SIMD API is a little more sane than Intel's, so this shouldn't take as much effort as SSE/AVX did.
After changing a few lines in vektor-gen
[1] it seems to have generated wrappers for aarch64
and friends [2].
I am now looking at faster
again, and there are quite a few imports of crate::vektor::x86_64
. My feeling is adding more architectures / fallbacks into the current structure could make things messy.
Have you thought about structuring faster
internally by architectures?
I was considering adding adding an arch
folder and subfolders for all platform-dependent things. That will affect things in intrin/
, but might also change top-level things like stride
(pretty much everything that imports crate::vektor::x86_64
today).
I haven't started anything, so I don't know if there are road blocks, but I wanted to check with you first since it might make the code look a bit different.
[1] https://github.com/ralfbiedert/vektor-gen [2] https://github.com/ralfbiedert/vektor/tree/more_archs/src
Yeah, I've been meaning to restructure the intrinsic wrappers. We should be able to define the core parts of faster (iters
, stride
, zip
, into_iters
) around wrapped intrinsics in an arch/
folder with arch-specific stuff like intrin/
, and vec_patterns
.
That also makes working with runtime feature detection a little easier, as we can just make each function generic over any type which implements Packed
(or something similar - haven't gotten that far with the runtime detection yet).
Edit: One quick thing about the changes to vektor
: That library acts as an adapter between the generic types in std::simd
and the arch-specific types in std::arch
, so you may encounter issues using it unless you rewrite the types in your aarch64 function stubs. The args/return types should look like u16x8
rather than uint16x8_t
Great! How do you want to handle this?
I don't mind trying something that gets thrown away if it doesn't fly. However, if you are working on this already (and / or runtime detection) it's probably much better if you do the structure.
Both ways are totally fine with me.
I'm not super far into runtime detection and most of my changes are within the arch-independent code, so it should be pretty adaptable to whatever you come up with. Feel free to rip it up as you see fit; I don't think we'll diverge much.
I made some changes now, up for discussion:
https://github.com/ralfbiedert/faster/tree/more_archs
arch
and intrin
. arch
contains implementation details for each architecture. Currently it holds x86
and shim
, later aarch64
, ... can be added. shim
should contain "default implementations". This is not done yet, and requires splitting up lots of impl
and macros. The idea would be that shim/intrin
provides a full implementation of intrin
, while other architectures can implement / override according to instruction set. intrin
are just traits now, all implementation was moved out. src/
items (e.g., vecs.rs
). I haven't touched vec_patterns.*
yet.preludes
are assembled to minimize "surprising" imports (at least surprising to me). faster::*
, faster::prelude::*
or manually import faster::x::y
. However, I haven't fully mapped the latter to the new layout, and some modules suddenly stop being pub
. Don't have too strong opinions right now, but I have the vague feeling there are "too many roads to Rome" right now. But maybe that's something that should be addressed separately after the whole refactoring is done. Update: Hang on, I just realized the shim
idea might not fly as I thought, because it might get tricky preventing double trait impl. What I am looking for is being able to have "default" implementations for all intrin
and easily add a more optimized one for a given architecture / feature set. Will investigate further ...
Alright, I now have a version that compiles and "mostly works" for x86
and unknown
architectures. The latter should compile for any architecture (sans endian, haven't thought too much about that) and can be fallback for anything not supported.
https://github.com/ralfbiedert/faster/tree/more_archs
More changes:
arch
should not have any architecture dependencies or macros anymore. vec_patterns_gen.py
is more generalized and now works for both x86
and unknown
shim
again since it was getting a bit ugly. All of that is either in unknown
, or back in /intrin
.What I am planning to do next:
/arch
back to /intrin
, and only keep very specific test cases there. aarch64
.Awesome, thank you so much for all of the hard work! The way this is laid out should blend nicely with runtime detection and user-defined SIMD types, too.
I've merged in all of the changes, and did a few quick formatting fixes. It looks like the tests are good and the perf is unchanged.
Hi,
I am trying to port a project to
aarch64
andwasm
using therust-2018-migration
branch. As of today I receive lots of:Ideally,
faster
would have fallbacks for not-yet supported architectures. That way I could just write my SIMD code once using the provided API, instead of having two separate implementations.Do you have any short-term plans of making such a fallback available for the 2018 version?
Also, while I am not a Rust expert, I have 1 - 2 days to look into this myself. If you think it's feasible to outline a solution you prefer, I'd be happy to try to help you out.