Compiling `rust-2018-migration` for `aarch64` and `wasm`

ralfbiedert commented 6 years ago

Hi,

I am trying to port a project to aarch64 and wasm using the rust-2018-migration branch. As of today I receive lots of:

2 | use crate::vektor::x86_64::*;
  |                    ^^^^^^ Could not find `x86_64` in `vektor`

Ideally, faster would have fallbacks for not-yet supported architectures. That way I could just write my SIMD code once using the provided API, instead of having two separate implementations.

Do you have any short-term plans of making such a fallback available for the 2018 version?

Also, while I am not a Rust expert, I have 1 - 2 days to look into this myself. If you think it's feasible to outline a solution you prefer, I'd be happy to try to help you out.

AdamNiederer commented 6 years ago

Hey, thanks for taking the initiative. Porting it is definitely going to be an effort, but doing the first port should make subsequent architectures much easier.

There are a few things we need to do to get faster running (with SIMD) on aarch64:

#[cfg(target_arch=...)] all of the vektor::x86... imports, and possibly gate all of the target_feature as well (can probably be done with sed)

To get it working without fallbacks, we would need to:

Update vektor to create aarch64 intrinsics
Add aarch64 versions of faster's intrinsics

Thankfully, I think NEON's SIMD API is a little more sane than Intel's, so this shouldn't take as much effort as SSE/AVX did.

ralfbiedert commented 6 years ago

After changing a few lines in vektor-gen[1] it seems to have generated wrappers for aarch64 and friends [2].

I am now looking at faster again, and there are quite a few imports of crate::vektor::x86_64. My feeling is adding more architectures / fallbacks into the current structure could make things messy.

Have you thought about structuring faster internally by architectures?

I was considering adding adding an arch folder and subfolders for all platform-dependent things. That will affect things in intrin/, but might also change top-level things like stride (pretty much everything that imports crate::vektor::x86_64 today).

I haven't started anything, so I don't know if there are road blocks, but I wanted to check with you first since it might make the code look a bit different.

[1] https://github.com/ralfbiedert/vektor-gen [2] https://github.com/ralfbiedert/vektor/tree/more_archs/src

AdamNiederer commented 6 years ago

Yeah, I've been meaning to restructure the intrinsic wrappers. We should be able to define the core parts of faster (iters, stride, zip, into_iters) around wrapped intrinsics in an arch/ folder with arch-specific stuff like intrin/, and vec_patterns.

That also makes working with runtime feature detection a little easier, as we can just make each function generic over any type which implements Packed (or something similar - haven't gotten that far with the runtime detection yet).

Edit: One quick thing about the changes to vektor: That library acts as an adapter between the generic types in std::simd and the arch-specific types in std::arch, so you may encounter issues using it unless you rewrite the types in your aarch64 function stubs. The args/return types should look like u16x8 rather than uint16x8_t

ralfbiedert commented 6 years ago

Great! How do you want to handle this?

I don't mind trying something that gets thrown away if it doesn't fly. However, if you are working on this already (and / or runtime detection) it's probably much better if you do the structure.

Both ways are totally fine with me.

AdamNiederer commented 6 years ago

I'm not super far into runtime detection and most of my changes are within the arch-independent code, so it should be pretty adaptable to whatever you come up with. Feel free to rip it up as you see fit; I don't think we'll diverge much.

ralfbiedert commented 6 years ago

I made some changes now, up for discussion:

https://github.com/ralfbiedert/faster/tree/more_archs

There are 2 top-level folders now: arch and intrin.
arch contains implementation details for each architecture. Currently it holds x86 and shim, later aarch64, ... can be added.
shim should contain "default implementations". This is not done yet, and requires splitting up lots of impl and macros. The idea would be that shim/intrin provides a full implementation of intrin, while other architectures can implement / override according to instruction set.
The root intrin are just traits now, all implementation was moved out.
Same applies to most top-level src/ items (e.g., vecs.rs). I haven't touched vec_patterns.* yet.
I was a bit confused who imports what and why at first, so I slightly changed how preludes are assembled to minimize "surprising" imports (at least surprising to me).
Related sidenote, I think we (you) should provide a guide w.r.t. module visibility. Right now multiple modules (re)export multiple types. My working assumption was that regular users should either use faster::*, faster::prelude::* or manually import faster::x::y. However, I haven't fully mapped the latter to the new layout, and some modules suddenly stop being pub. Don't have too strong opinions right now, but I have the vague feeling there are "too many roads to Rome" right now. But maybe that's something that should be addressed separately after the whole refactoring is done.

Update: Hang on, I just realized the shim idea might not fly as I thought, because it might get tricky preventing double trait impl. What I am looking for is being able to have "default" implementations for all intrin and easily add a more optimized one for a given architecture / feature set. Will investigate further ...

ralfbiedert commented 6 years ago

Alright, I now have a version that compiles and "mostly works" for x86 and unknown architectures. The latter should compile for any architecture (sans endian, haven't thought too much about that) and can be fallback for anything not supported.

https://github.com/ralfbiedert/faster/tree/more_archs

More changes:

Everything not in arch should not have any architecture dependencies or macros anymore.
vec_patterns_gen.py is more generalized and now works for both x86 and unknown
I removed the shim again since it was getting a bit ugly. All of that is either in unknown, or back in /intrin.

What I am planning to do next:

Move most test cases from /arch back to /intrin, and only keep very specific test cases there.
Clean up imports
Investigate why some tests fail
Once that is done I'll start looking at aarch64.

AdamNiederer commented 6 years ago

Awesome, thank you so much for all of the hard work! The way this is laid out should blend nicely with runtime detection and user-defined SIMD types, too.

AdamNiederer commented 6 years ago

I've merged in all of the changes, and did a few quick formatting fixes. It looks like the tests are good and the perf is unchanged.

AdamNiederer / faster

Compiling `rust-2018-migration` for `aarch64` and `wasm` #47