Open ralfbiedert opened 6 years ago
I've always intended to remove intrinsics which are implemented in std::simd, but only once they've been rfc'd in explicitly or stabilized. I do think it's a good idea for faster to add some basic SIMD algorithms which can be done on most architectures (at least x86_64 and aarch64). Stuff like the vector popcnts.
The iterator system is definitely going to be faster's main value-add after std::simd is stabilized, however. I don't think they're trying to break into that space, and I don't want to duplicate the work they're doing.
I think the degree to which we can eschew std::arch and my wrapper is pretty reliant on the surface area of std::simd. I need vector masks, gathers, scatters, and certain types of shuffles to make many of the iterators performant.
Just an update that I'm a bit stuck.
The good news is, with the latest changes in packed_simd I was now able compile a faster
core, that doesn't rely on anything else than std::simd
. In contrast to the current faster
it's very thin, but most intrinsics are missing right now:
https://github.com/ralfbiedert/faster/tree/budget_cuts
In parallel, I was trying to update vektor
to the latest stdsimd
changes.
It's frustrating, since apparently vektor
now also needs to rely on std::simd
, and with the introduction of #[rustc_deprecated]
and #[stable]
in the crate scaping isn't straightforward anymore (both produce error: stability attributes may not be used outside of the standard library
, which ultimately means a more fragile scrape.py
that needs to handle these attributes and the respective deprecations).
I could push forward either way, but neither one seems to be easy:
A) Fixing vektor
and the scraper could work, but the more I look at it, the less I like it. It feels hacky (i.e., easy to break with new Rust versions), and essentially just creates another abstraction next to std::simd
.
B) Ditching vektor
for std::simd
on the other hand will require bigger changes in the code. You mentioned you also wanted most existing intrinsics, so that means arch/
would probably end up looking more like packed_simd
internally (i.e., manually calling std::arch
intrinsics and transmuting parameters).
It will be quite some work to get them back in place; work there might interfere with your plans of adding dynamic feature selection.
Option B) is still my favorite due to the cleaner code it promises. However, I feel I can't really push this forward myself, as it involves making some major architectural judgement calls that might interfere with dynamic feature selection and would cut down intrinsics unless they have been restored bit-by-bit.
Option A) I wouldn't really want to touch after my latest stdsimd
experiments, unless you / someone looks at it and affirms it really still is the way to go (and maybe fixes scraping and the amalgamation of vektor
with std::simd
).
I don't think they're trying to break into that space,
I can confirm that this is not the intent. std::simd
should just provide a way to portably work with packed SIMD vectors, a minimum common denominator of sorts. Iterators
and other higher level constructs probably belong somewhere else.
I need vector masks, gathers, scatters, and certain types of shuffles to make many of the iterators performant.
Portable vector masks and shuffles are already available. Portable masked vector gather, scatters, as well as compressed stores and uncompressed loads are partially implemented. A PR should land on packed_simd
soon with them. I aim to do a 0.2 release with these features available.
How good these will work in practice, and whether std::arch
will be needed to work around imperfect codegen, remains to be seen, but I consider these to be bugs in std::simd
, so that workarounds could be added there (llvm's x86 gather and scatters are implemented on top of the portable ones IIRC, so at least for the cases in which there exist a corresponding x86 instruction, the portable gather and scatter should already work ok).
Portable vector masks and shuffles are already available. Portable masked vector gather, scatters, as well as compressed stores and uncompressed loads are partially implemented. A PR should land on packed_simd soon with them. I aim to do a 0.2 release with these features available.
That's good to hear. Apologies if I'm a bit out of the loop, but is the current iteration of the std::simd
RFC a good approximation of what we'll be looking at once it's merged? I know there was a bit of churn on it previously.
Apologies if I'm a bit out of the loop, but is the current iteration of the std::simd RFC a good approximation of what we'll be looking at once it's merged?
I'd say, 95% of it is a good approximation. There are some method names that have changed in packed_simd
but otherwise all other packed_simd
changes are backwards compatible with the RFC.
The main change is that all types in the RFC like f32x4
are now type aliases to a single Simd<[T; N]>
type. This was required by the gather / scatters which use vectors of pointers, so that users can write Simd<[*const *const *const f32; 4]>
which is just a vector of 4 pointers. This should also make the library easier to use once const generics
land.
The most controversial thing in the RFC is the approximate floating-point methods, so as long as you don't use those you should be fine. I am hopeful that we can include them in some form, but there will be bikeshedding about the approximation error, how to control it, etc.
The largest missing feature in packed_simd
with respect to the RFC is making all arithmetic checked by default. Right now it is all wrapping.
Opening another ticket since this is a separate discussion from #47 and might be more controversial:
The more I look into the upcoming
std::simd
, the more I wonder iffaster
should not become a thinner "SIMD-friendly iteration" library that neatly plugs intostd::simd
and is really good at handling variable slices, zipping, ... instead of providing a blanket implementation overstd::arch
.Right now it seems that many common intrinsics and operations faster provides on packed types are or might be implemented in
std::simd
(compare coresimd/ppsv).At the same time, for things that won't be in
std::simd
(and will be more platform specific), faster will have a hard time providing a consistent performance story anyway.By that reasoning I see a certain appeal primarily focusing on a more consistent cross-platform experience with a much lighter code base (e.g., imagine faster without
arch/
andintrin/
and using mostlystd::simd
instead ofvektor
).Faster could also integrate
std::arch
specific functions and types, but rather as extensions and helpers (e.g., for striding) for special use cases, instead of using them as internal fundamentals.