AdamNiederer / faster

SIMD for humans
Mozilla Public License 2.0
1.56k stars 51 forks source link

Compiling `rust-2018-migration` for `aarch64` and `wasm` #47

Open ralfbiedert opened 6 years ago

ralfbiedert commented 6 years ago

Hi,

I am trying to port a project to aarch64 and wasm using the rust-2018-migration branch. As of today I receive lots of:

2 | use crate::vektor::x86_64::*;
  |                    ^^^^^^ Could not find `x86_64` in `vektor`

Ideally, faster would have fallbacks for not-yet supported architectures. That way I could just write my SIMD code once using the provided API, instead of having two separate implementations.

Do you have any short-term plans of making such a fallback available for the 2018 version?

Also, while I am not a Rust expert, I have 1 - 2 days to look into this myself. If you think it's feasible to outline a solution you prefer, I'd be happy to try to help you out.

AdamNiederer commented 6 years ago

Hey, thanks for taking the initiative. Porting it is definitely going to be an effort, but doing the first port should make subsequent architectures much easier.

There are a few things we need to do to get faster running (with SIMD) on aarch64:

To get it working without fallbacks, we would need to:

Thankfully, I think NEON's SIMD API is a little more sane than Intel's, so this shouldn't take as much effort as SSE/AVX did.

ralfbiedert commented 6 years ago

After changing a few lines in vektor-gen[1] it seems to have generated wrappers for aarch64 and friends [2].

I am now looking at faster again, and there are quite a few imports of crate::vektor::x86_64. My feeling is adding more architectures / fallbacks into the current structure could make things messy.

Have you thought about structuring faster internally by architectures?

I was considering adding adding an arch folder and subfolders for all platform-dependent things. That will affect things in intrin/, but might also change top-level things like stride (pretty much everything that imports crate::vektor::x86_64 today).

I haven't started anything, so I don't know if there are road blocks, but I wanted to check with you first since it might make the code look a bit different.

[1] https://github.com/ralfbiedert/vektor-gen [2] https://github.com/ralfbiedert/vektor/tree/more_archs/src

AdamNiederer commented 6 years ago

Yeah, I've been meaning to restructure the intrinsic wrappers. We should be able to define the core parts of faster (iters, stride, zip, into_iters) around wrapped intrinsics in an arch/ folder with arch-specific stuff like intrin/, and vec_patterns.

That also makes working with runtime feature detection a little easier, as we can just make each function generic over any type which implements Packed (or something similar - haven't gotten that far with the runtime detection yet).

Edit: One quick thing about the changes to vektor: That library acts as an adapter between the generic types in std::simd and the arch-specific types in std::arch, so you may encounter issues using it unless you rewrite the types in your aarch64 function stubs. The args/return types should look like u16x8 rather than uint16x8_t

ralfbiedert commented 6 years ago

Great! How do you want to handle this?

I don't mind trying something that gets thrown away if it doesn't fly. However, if you are working on this already (and / or runtime detection) it's probably much better if you do the structure.

Both ways are totally fine with me.

AdamNiederer commented 6 years ago

I'm not super far into runtime detection and most of my changes are within the arch-independent code, so it should be pretty adaptable to whatever you come up with. Feel free to rip it up as you see fit; I don't think we'll diverge much.

ralfbiedert commented 6 years ago

I made some changes now, up for discussion:

https://github.com/ralfbiedert/faster/tree/more_archs

Update: Hang on, I just realized the shim idea might not fly as I thought, because it might get tricky preventing double trait impl. What I am looking for is being able to have "default" implementations for all intrin and easily add a more optimized one for a given architecture / feature set. Will investigate further ...

ralfbiedert commented 6 years ago

Alright, I now have a version that compiles and "mostly works" for x86 and unknown architectures. The latter should compile for any architecture (sans endian, haven't thought too much about that) and can be fallback for anything not supported.

https://github.com/ralfbiedert/faster/tree/more_archs

More changes:

What I am planning to do next:

AdamNiederer commented 6 years ago

Awesome, thank you so much for all of the hard work! The way this is laid out should blend nicely with runtime detection and user-defined SIMD types, too.

AdamNiederer commented 6 years ago

I've merged in all of the changes, and did a few quick formatting fixes. It looks like the tests are good and the perf is unchanged.