AdamNiederer / faster

SIMD for humans
Mozilla Public License 2.0
1.56k stars 51 forks source link

Faster and std::simd #53

Open ralfbiedert opened 6 years ago

ralfbiedert commented 6 years ago

Opening another ticket since this is a separate discussion from #47 and might be more controversial:

The more I look into the upcoming std::simd, the more I wonder if faster should not become a thinner "SIMD-friendly iteration" library that neatly plugs into std::simd and is really good at handling variable slices, zipping, ... instead of providing a blanket implementation over std::arch.

Right now it seems that many common intrinsics and operations faster provides on packed types are or might be implemented in std::simd (compare coresimd/ppsv).

At the same time, for things that won't be in std::simd (and will be more platform specific), faster will have a hard time providing a consistent performance story anyway.

By that reasoning I see a certain appeal primarily focusing on a more consistent cross-platform experience with a much lighter code base (e.g., imagine faster without arch/ and intrin/ and using mostly std::simd instead of vektor).

Faster could also integrate std::arch specific functions and types, but rather as extensions and helpers (e.g., for striding) for special use cases, instead of using them as internal fundamentals.

AdamNiederer commented 6 years ago

I've always intended to remove intrinsics which are implemented in std::simd, but only once they've been rfc'd in explicitly or stabilized. I do think it's a good idea for faster to add some basic SIMD algorithms which can be done on most architectures (at least x86_64 and aarch64). Stuff like the vector popcnts.

The iterator system is definitely going to be faster's main value-add after std::simd is stabilized, however. I don't think they're trying to break into that space, and I don't want to duplicate the work they're doing.

I think the degree to which we can eschew std::arch and my wrapper is pretty reliant on the surface area of std::simd. I need vector masks, gathers, scatters, and certain types of shuffles to make many of the iterators performant.

ralfbiedert commented 6 years ago

Just an update that I'm a bit stuck.

The good news is, with the latest changes in packed_simd I was now able compile a faster core, that doesn't rely on anything else than std::simd. In contrast to the current faster it's very thin, but most intrinsics are missing right now:

https://github.com/ralfbiedert/faster/tree/budget_cuts

In parallel, I was trying to update vektor to the latest stdsimd changes.

It's frustrating, since apparently vektor now also needs to rely on std::simd, and with the introduction of #[rustc_deprecated] and #[stable] in the crate scaping isn't straightforward anymore (both produce error: stability attributes may not be used outside of the standard library, which ultimately means a more fragile scrape.py that needs to handle these attributes and the respective deprecations).

I could push forward either way, but neither one seems to be easy:

A) Fixing vektor and the scraper could work, but the more I look at it, the less I like it. It feels hacky (i.e., easy to break with new Rust versions), and essentially just creates another abstraction next to std::simd.

B) Ditching vektor for std::simd on the other hand will require bigger changes in the code. You mentioned you also wanted most existing intrinsics, so that means arch/ would probably end up looking more like packed_simd internally (i.e., manually calling std::arch intrinsics and transmuting parameters).

It will be quite some work to get them back in place; work there might interfere with your plans of adding dynamic feature selection.

Option B) is still my favorite due to the cleaner code it promises. However, I feel I can't really push this forward myself, as it involves making some major architectural judgement calls that might interfere with dynamic feature selection and would cut down intrinsics unless they have been restored bit-by-bit.

Option A) I wouldn't really want to touch after my latest stdsimd experiments, unless you / someone looks at it and affirms it really still is the way to go (and maybe fixes scraping and the amalgamation of vektor with std::simd).

gnzlbg commented 6 years ago

I don't think they're trying to break into that space,

I can confirm that this is not the intent. std::simd should just provide a way to portably work with packed SIMD vectors, a minimum common denominator of sorts. Iterators and other higher level constructs probably belong somewhere else.

I need vector masks, gathers, scatters, and certain types of shuffles to make many of the iterators performant.

Portable vector masks and shuffles are already available. Portable masked vector gather, scatters, as well as compressed stores and uncompressed loads are partially implemented. A PR should land on packed_simd soon with them. I aim to do a 0.2 release with these features available.

How good these will work in practice, and whether std::arch will be needed to work around imperfect codegen, remains to be seen, but I consider these to be bugs in std::simd, so that workarounds could be added there (llvm's x86 gather and scatters are implemented on top of the portable ones IIRC, so at least for the cases in which there exist a corresponding x86 instruction, the portable gather and scatter should already work ok).

AdamNiederer commented 6 years ago

Portable vector masks and shuffles are already available. Portable masked vector gather, scatters, as well as compressed stores and uncompressed loads are partially implemented. A PR should land on packed_simd soon with them. I aim to do a 0.2 release with these features available.

That's good to hear. Apologies if I'm a bit out of the loop, but is the current iteration of the std::simd RFC a good approximation of what we'll be looking at once it's merged? I know there was a bit of churn on it previously.

gnzlbg commented 6 years ago

Apologies if I'm a bit out of the loop, but is the current iteration of the std::simd RFC a good approximation of what we'll be looking at once it's merged?

I'd say, 95% of it is a good approximation. There are some method names that have changed in packed_simd but otherwise all other packed_simd changes are backwards compatible with the RFC.

The main change is that all types in the RFC like f32x4 are now type aliases to a single Simd<[T; N]> type. This was required by the gather / scatters which use vectors of pointers, so that users can write Simd<[*const *const *const f32; 4]> which is just a vector of 4 pointers. This should also make the library easier to use once const generics land.

The most controversial thing in the RFC is the approximate floating-point methods, so as long as you don't use those you should be fine. I am hopeful that we can include them in some form, but there will be bikeshedding about the approximation error, how to control it, etc.

The largest missing feature in packed_simd with respect to the RFC is making all arithmetic checked by default. Right now it is all wrapping.