QuState / PhastFT

A high-performance, "quantum-inspired" Fast Fourier Transform (FFT) library written in pure and safe Rust.
Apache License 2.0
193 stars 8 forks source link

Add functions to support interleaved complex num #27

Closed smu160 closed 1 week ago

smu160 commented 2 months ago

@calebzulawski @Shnatsel

A draft PR to address issue #22.

codecov-commenter commented 2 months ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.16%. Comparing base (20fafba) to head (713b3a6).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #27 +/- ## ======================================= Coverage 99.16% 99.16% ======================================= Files 8 8 Lines 841 841 ======================================= Hits 834 834 Misses 7 7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

smu160 commented 2 months ago

Hi @calebzulawski,

We now have the basics of what we'd need to support signals that are stored as an array of complex nums. Looking forward to hearing your thoughts!

Best, Saveliy

calebzulawski commented 2 months ago

Something like this looks good to me! Are you open to possibly using a more trait-oriented design in the future? I wouldn't want to hold up this functionality, but I think various input and output types could be combined into one generic function.

smu160 commented 2 months ago

Hi @calebzulawski,

Thank you for the review! I think your trait-oriented solution would be great. Just to clarify, do we want to wait on merging this PR? Or do we want this PR as an interim solution?

Best, Saveliy

calebzulawski commented 2 months ago

I would probably merge it to have something to start from--just something to consider before a 1.0 release

Shnatsel commented 2 months ago

I've taken the liberty of rewriting separate_re_im and pushing to your branch.

I couldn't get it to vectorize on godbolt, but at least it should have the right public API, and we can always optimize it later.

Shnatsel commented 2 months ago

I tried a bunch more things and couldn't get it to autovectorize: https://godbolt.org/z/7Kd8v4vPG

We might need something like bytemuck here to view a Complex as a &[T; 2] or even better &[Complex<T>] as &[T] with double the length

calebzulawski commented 2 months ago

I think the approach I would take is iterate by chunks, bytemuck to [T; N * 2], Simd::deinterleave, wrapped in #[multiversion]

Shnatsel commented 2 months ago

Yeah that sounds good to me

calebzulawski commented 2 months ago

I am working on a simd-complex crate--this seems like an important enough algorithm to include there as an AoSoA <-> AoS conversion

smu160 commented 2 months ago

@Shnatsel

I have a rough version of bytemuck + Simd deinterleave, on godbolt.

Let me know your thoughts. Thank you!

Shnatsel commented 2 months ago

I've taken a look at the Godbolt version. Wow, that is some aggressive loop unrolling! It looks good to me though.

One more thing to try: we need to benchmark if iterating over vec![0; n] is faster or slower than Vec::with_capacity(n) plus Vec::extend_from_slice(deinterleaved.to_array()) in the main loop. The latter incurs slightly more instructions but avoids a memset to zero out both vectors.

smu160 commented 2 months ago

@Shnatsel I added the implementation using bytemuck + portable SIMD deinterleave. The solution in godbolt had a bounds check, so that was removed as well. You can see the difference in godbolt.

smu160 commented 2 months ago

@Shnatsel

One more thing to try: we need to benchmark if iterating over vec![0; n] is faster or slower than Vec::with_capacity(n) plus Vec::extend_from_slice(deinterleaved.to_array()) in the main loop. The latter incurs slightly more instructions but avoids a memset to zero out both vectors.

I added a benchmark, using criterion, to test this idea. The benchmark is probably not leveraging criterion to its full potential, so let me know.

Running benches/bench.rs (target/release/deps/bench-3ef076ccef6e83be)
Gnuplot not found, using plotters backend
separate_re_im/with_zeroed_vecs/1024
                        time:   [405.64 ns 408.02 ns 411.05 ns]
                        thrpt:  [2.4912 Gelem/s 2.5097 Gelem/s 2.5244 Gelem/s]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe
separate_re_im/with_capacity/1024
                        time:   [422.60 ns 425.43 ns 428.32 ns]
                        thrpt:  [2.3907 Gelem/s 2.4070 Gelem/s 2.4231 Gelem/s]
Found 18 outliers among 100 measurements (18.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  15 (15.00%) high severe
separate_re_im/with_zeroed_vecs/4096
                        time:   [1.9314 µs 1.9407 µs 1.9508 µs]
                        thrpt:  [2.0996 Gelem/s 2.1106 Gelem/s 2.1207 Gelem/s]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
separate_re_im/with_capacity/4096
                        time:   [1.7073 µs 1.7213 µs 1.7380 µs]
                        thrpt:  [2.3567 Gelem/s 2.3795 Gelem/s 2.3991 Gelem/s]
Found 19 outliers among 100 measurements (19.00%)
  17 (17.00%) low mild
  2 (2.00%) high severe
separate_re_im/with_zeroed_vecs/16384
                        time:   [8.1350 µs 8.4114 µs 8.7453 µs]
                        thrpt:  [1.8735 Gelem/s 1.9478 Gelem/s 2.0140 Gelem/s]
separate_re_im/with_capacity/16384
                        time:   [8.5694 µs 8.6258 µs 8.7400 µs]
                        thrpt:  [1.8746 Gelem/s 1.8994 Gelem/s 1.9119 Gelem/s]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
separate_re_im/with_zeroed_vecs/65536
                        time:   [32.902 µs 33.322 µs 33.821 µs]
                        thrpt:  [1.9377 Gelem/s 1.9668 Gelem/s 1.9918 Gelem/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
separate_re_im/with_capacity/65536
                        time:   [39.719 µs 39.987 µs 40.303 µs]
                        thrpt:  [1.6261 Gelem/s 1.6389 Gelem/s 1.6500 Gelem/s]
Found 27 outliers among 100 measurements (27.00%)
  3 (3.00%) low severe
  11 (11.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe
separate_re_im/with_zeroed_vecs/262144
                        time:   [243.93 µs 247.17 µs 251.19 µs]
                        thrpt:  [1.0436 Gelem/s 1.0606 Gelem/s 1.0747 Gelem/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
separate_re_im/with_capacity/262144
                        time:   [276.91 µs 279.97 µs 283.36 µs]
                        thrpt:  [925.12 Melem/s 936.31 Melem/s 946.67 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Benchmarking separate_re_im/with_zeroed_vecs/1048576: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
separate_re_im/with_zeroed_vecs/1048576
                        time:   [1.2126 ms 1.2465 ms 1.2828 ms]
                        thrpt:  [817.42 Melem/s 841.24 Melem/s 864.72 Melem/s]
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  13 (13.00%) high severe
Benchmarking separate_re_im/with_capacity/1048576: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.4s, enable flat sampling, or reduce sample count to 50.
separate_re_im/with_capacity/1048576
                        time:   [1.4784 ms 1.4848 ms 1.4942 ms]
                        thrpt:  [701.76 Melem/s 706.22 Melem/s 709.26 Melem/s]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Edit: Added in results