SSE support - Githubissues

HEnquist commented 3 years ago

This is still very much work in progress, but I thought I would show what I'm up to. Looking at the AVX code, it's quite big and many of the things there don't look like they would work well with the much smaller SSE instruction sets. Instead I started out from the scalar code, and have made a hybrid solution where I use SSE code whenever it's ready, and fall back to the scalar one otherwise. For now I have butterflies of length 2,3,4,5,8,16,32, and Radix4 working. The other algorightms will use these as inners, to also get some speedup. Surprisingly I get about the same speedup for f32 as for f64. For radix4, that varies between +30% to +90%. There may be more to gain by tweaking a bit here and there. For the individual butterflies the gain increases with the length, and for 16 and 32 they are about 3x as fast as the scalars (both for f32 and f64). I will continue now with implementing the rest of the butterflies. (I haven't abandoned the estimating planner, this got caught up in this when it started working well).

HEnquist commented 3 years ago

Time for an update. I have now implemented all the butterflies, plus a few more that should perhaps be added also to the scalar version. I also polished the radix4 a bit. Here is an updated comparison: power2new

I also ran a check from length 2 to 300: 2to300

Next step is to work on the mixed radix algos, since I'm still using the scalar ones (but with the sse butterflies and radix4 as inners).

ejmahler commented 3 years ago

It looks like there are a bunch of sizes on AVX that are slower than scalar, which surprises me. Can you tell what some of those sizes are?

HEnquist commented 3 years ago

I'll make a list of the sizes where the avx code is slow. It's only a handful out of the 300+300 I checked.

Edit: I checked and it's the same lengths for f32 and f64: 169,221, 247, 289, 299

HEnquist commented 3 years ago

I have looked into the mixed radixes, bluesteins and raders, and I think the benefit from SSE:ing thos would be quite small. Something like 80-90% of the time is spent on the inner FFTs, so there isn't much to work with. If you are ok with that, I will continue using the scalar ones for now.

HEnquist commented 3 years ago

Here are the latest graphs: Lengths 2 to 300: 2to300new

Same data as histograms: 2to300hist

I think I'm close to something that could be merged. Could you take a look?

ejmahler commented 3 years ago

Sure. I'll pull it and go through it in more detail.

ejmahler commented 3 years ago

One last thought, I'd like to think more about SSE prime butterflies. Maybe find a way to abstract out common operations, or something. I think one of my biggest problems with FFTW is that its autogenerated code is completely unapproachable, and I think these fall victim to the same problem, I don't think it'll ever be perfectly readable, because these big prime butterflies are inherently repetitive and verbose, but any improvement would be welcome. Can any data be stored in arrays, and processed in loops, for example?

Overall, this PR is clearly the result of a lot of hard work, and I really appreciate it. I don't think this PR is quite ready yet, but it won't take much to get it there. Thank you!

ejmahler commented 3 years ago

One last last thought, I think we should create a crate feature for SSE, thesame way we did for AVX, so that users can opt out of it if they care more about compile time/binary size than they care about FFT speed.

HEnquist commented 3 years ago

I'll take a look at the autogenerated butterflies. They are pretty horrible yes, that's why I have hidden them in a separate file.. It might well be possible to use loops or macros or something to get rid of parts of it.

I'm thinking that the short butteflies (up to 7 or so) are quite readable, and there I made an effort to document the procedure. Then the longer ones follow the same logic, but get increasingly more difficult to read. But they keep the same structure, there is just more of everything.

I have tried reading the FFTW sources to see how it does things, but that turned out to be pretty much impossible.

HEnquist commented 3 years ago

I have looked at how to imrove the autogenerated butterflies. Loops are unfortunately difficult to use, but I could shorten things a bit by using length-2-ffts for some things, and by using a macro to do the long "value = temp1 + temp2 - temp3 + temp4 +...". I tried on on the length 11 one. Can you take a look here: https://github.com/HEnquist/RustFFT/blob/calcmacros/src/sse/sse_prime_butterflies.rs#L471

It runs at the same speed as before. If this looks good, I will update all the butterflies in the same way.

ejmahler commented 3 years ago

Wow, that will be a dramatic improvement. I especially like the macro - did you write that yourself?

Definitely get that in. Thanks for researching that!

That will reduce 20-30 long call pyramids to single lines, which I hope will bring the length of these functions down to something approachable.

HEnquist commented 3 years ago

Great, thanks! Then I'll apply this to all the butterflies. I wrote the macro myself yes. For some reason I really enjoy solving things with macros.

HEnquist commented 3 years ago

I tried running rustfmt on the autogenerated butterflies, but it made a mess of the calc! macros. Those lines tend to become a bit long, but since it becomes a table I think it's quite ok. Like here: https://github.com/HEnquist/RustFFT/blob/ssesimple/src/sse/sse_prime_butterflies.rs#L3317 The formatter sometimes put each term on its own row, and sometimes broke the row into a few shorter ones. The result was much harder to read, so I excluded that file from formatting. Ok?

ejmahler commented 3 years ago

Yeah, excluding any autogenerated file from rustfmt seems reasonable to me.

ejmahler commented 3 years ago

Where do you think this is at? If you think it's ready, I can do another review pass.

HEnquist commented 3 years ago

There is one naming question left: https://github.com/ejmahler/RustFFT/pull/60#discussion_r599948119 Apart from that I think it's in good shape.

HEnquist commented 3 years ago

I realized the radix4 bit reverse shuffle needs some more work. I started looking into using it also for the scalar radix4, and made some proper benches for just the shuffle. Right now it's faster at medium lenghts (about 10k - 100k), but it actually gets slower for large arrays. I have some ideas for how to solve that. It would also be much nicer to use the same shuffle function for sse and scalar radix4, instead of having two copies like here. I would propose that we pause this PR for the moment, and I sort out the scalar shuffler first in a separate PR. After that is ready and merged, I update this PR to use the same shuffler. What do you think?

ejmahler commented 3 years ago

Instead of putting this on hold, it would make sense to me to just have the SSE code use the radix4 reordering from master. That way, we can check this in sooner, and we can consider the reordering improvement to be an orthogonal change that this doesn't depend on.

Unrelated: I found that doing direct bit reversal was too slow compared to the recursive approach with scalar code, but I wonder if it would be faster using SSE?

HEnquist commented 3 years ago

Ready for review! I went back to the scalar shuffler for the radix4, and changed the function names, all the ones with 1st and 2nd in them.

I have played a little with the bit reverse. This version runs at the same speed as the current one:

pub fn reverse_bits(value: usize, bits: usize) -> usize {
    let mut result: usize = 0; 
    let mut value = value;
    for _ in 0..bits {
        result = (result<<2) + (value & 0x03);
        value = value>>2;
    }
    result
}

pub unsafe fn bitrev_transpose<T: Copy>(width: usize, height: usize, input: &[T], output: &mut [T], bits: usize) {
    for x in 0..width {
        let x_rev = reverse_bits(x, bits);
        for y in 0..height {
            let input_index = x_rev + y * width;
            let output_index = y + x * height;

            *output.get_unchecked_mut(output_index) = *input.get_unchecked(input_index);
        }
    }
}

It walks through input and output arrays in the same order as the recursive one. Now I'll start working on getting this to split the work into cache friendly chunks.

HEnquist commented 3 years ago

I believe I have addressed all the comments now, except the question about target_feature vs inline(always) where I need a little input.

ejmahler commented 3 years ago

Thanks for the update, and thanks for doing some thorough research into the inlining issue. I finally have a day off, so I'll be looking into this today. I anticipate being able to merge this today.

ejmahler commented 3 years ago

Ok, after digging into it, I understand the problem space a little more. Because of the distinction between single and parallel F32 (Basically, parallel vs remainder), we can't possibly avoid having multiple copies of the core FFT function. If we move the inlines vs SSE 4.1 declarations, we just slightly change what gets duplicated and what doesn't.

But right now, because of the structure of

let alldone = array_utils::iter_chunks(buffer, 2 * self.len(), |chunk| {
                    self.perform_parallel_fft_butterfly(chunk)
                });
                if alldone.is_err() && buffer.len() >= self.len() {
                    self.perform_fft_butterfly(&mut buffer[len - self.len()..]);
                }

being inside a target feature, both of these functions get inlined together along with the loop code. The end result is fewer function calls and probably better instruction locality.

I noticed that F64 doesn't need the distinction between mainloop and remainder FFTs because it can only store one at a time, so i took a stab at applying this inline change there to see if it suffers from the same 5-10% drawbacks - and it turns out it does. So it definitely isn't instruction locality, because inspecting the assembly confirms that the actual FFT function code doesn't get duplicated.

My last hunch is that maybe the un-inlineable function call in the loop is the problem: When the target_feature is on perform_fft_contiguous, we run a loop that does nothing but call perform_fft_contiguous n times - and without some refactoring there's no way to even test whether that's the case. So for now I'm going to shelve this line of thought. I would like to revisit it at some point after this merges though, because if we can get it to work without the performance hit, there will be a huge binary size reduction.

ejmahler commented 3 years ago

And it's in! Thanks again for making this happen.

My plan is to look into the precision issues today, then publish a v5.1, then update num-complex, then push v6.0, hopefully all today

HEnquist commented 3 years ago

Excellent! Thank you so much :) This means I should hurry up with the new radix 4 shuffler. It's ready, just needs some cleanup. I will do it today!

ejmahler / RustFFT

SSE support #60