Closed daniellga closed 2 years ago
What are the exact sizes you are benchmarking? Could you make a series with just powers of two? What CPU is this?
Torch seems to be using Intel mkl when running on the CPU. For powers of two that should be roughly the same speed as rustfft, but for other lengths I don't know.
For the benchmark above n
starts as 25 and sums 200 until 50000 (25, 225, 425...). The values in the vector are repeated: 1 + 1j
.
For powers of 2, I am showing 2 graphs of the same data, the first being log10 transformed on the x axis.
code:
library(bench)
library(torch)
library(reticulate)
library(ggplot2)
# powers of 2
results <- bench::press(
n = 2^seq(1, 20, 1),
{
x = rep(1+1i, n)
py$x = x
y = torch_tensor(x)
py_run_string("import numpy as np", convert = FALSE)
mark(
audior::fft(x),
stats::fft(x),
torch::torch_fft_fft(y),
py_eval("np.fft.fft(x)", convert = FALSE),
iterations = 100,
check = FALSE
)
}
)
ggplot(results) + geom_point(aes(x = n, y = median, color = as.character(expression))) +
scale_colour_discrete(name = "implementation", labels=c("rustfft", "numpy.fft.fft (via reticulate)", "stats::fft", "torch::torch_fft_fft")) + scale_x_continuous(trans='log10')
ggplot(results) + geom_point(aes(x = n, y = median, color = as.character(expression))) +
scale_colour_discrete(name = "implementation", labels=c("rustfft", "numpy.fft.fft (via reticulate)", "stats::fft", "torch::torch_fft_fft"))
Infos:
R version 4.2.0 (2022-04-22)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pt_BR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.3.6 reticulate_1.25 torch_0.8.0 bench_1.1.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 here_1.0.1 lattice_0.20-45 prettyunits_1.1.1 png_0.1-7
[6] ps_1.7.0 rprojroot_2.0.3 digest_0.6.29 utf8_1.2.2 R6_2.5.1
[11] pillar_1.7.0 rlang_1.0.2 rstudioapi_0.13 rextendr_0.2.0.9000 callr_3.7.0
[16] Matrix_1.4-1 labeling_0.4.2 desc_1.4.1 devtools_2.4.3 stringr_1.4.0
[21] bit_4.0.4 munsell_0.5.0 compiler_4.2.0 xfun_0.31 pkgconfig_2.0.3
[26] pkgbuild_1.3.1 tidyselect_1.1.2 tibble_3.1.7 roxygen2_7.2.0 fansi_1.0.3
[31] crayon_1.5.1 dplyr_1.0.9 withr_2.5.0 brio_1.1.3 commonmark_1.8.0
[36] grid_4.2.0 jsonlite_1.8.0 gtable_0.3.0 lifecycle_1.0.1 magrittr_2.0.3
[41] coro_1.0.2 scales_1.2.0 cli_3.3.0 stringi_1.7.6 cachem_1.0.6
[46] farver_2.1.0 fs_1.5.2 remotes_2.4.2 testthat_3.1.4 xml2_1.3.3
[51] ellipsis_0.3.2 generics_0.1.2 vctrs_0.4.1 tools_4.2.0 bit64_4.0.5
[56] glue_1.6.2 purrr_0.3.4 processx_3.5.3 pkgload_1.2.4 fastmap_1.1.0
[61] colorspace_2.0-3 sessioninfo_1.2.2 memoise_2.0.1 knitr_1.39 usethis_2.1.5
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz
I compared a little with old meaurements, and your numbers look quite high. Look here: https://github.com/ejmahler/RustFFT/pull/65#issuecomment-820276070
For example at length 1024, I measured about 7 us per iteration, while you get something like 200 us. And at length 131072 I have about 2 ms while you have something like 5 ms. My old numbers are from a Ryzen laptop cpu, shouldn't be too different from your i5.
Can you show the code you use to perform the fft (with rustfft) for every iteration?
Here's what I get for 1024 and 131072 respectively:
code in rust is below. It uses extendr crate to convert to and from R. Please let me know if you have any more questions or if you want me to share the repo with you.
lib.rs
use extendr_api::prelude::*;
use std::path::Path;
mod misc;
mod stft;
#[extendr]
pub fn fft(complexes: Complexes) -> Complexes {
stft::fft(complexes)
}
// needed to register the function here to call from R
extendr_module! {
mod audior;
fn load;
//fn stream;
fn to_mono;
fn get_duration;
fn get_samplerate;
fn resample;
fn realfft;
fn fft;
}
stft.rs
use rustfft::FftPlanner;
use extendr_api::{Complexes, Doubles};
use ndarray::{Array2, ArrayView2, Axis};
use extendr_api::{RMatrix, Complexes, scalar::Rcplx};
use num::Complex;
use std::convert::From;
// just a wrapper to implement From trait
pub struct VecNumComplex<T> {
pub data: Vec<Complex<T>>
}
impl From<Complexes> for VecNumComplex<f64> {
fn from(complexes: Complexes) -> Self {
VecNumComplex {
data: complexes.iter().map(|x| Complex{ re: x.re().0, im: x.im().0 }).collect()
}
}
}
impl From<VecNumComplex<f64>> for Complexes {
fn from(vecnumcomplex: VecNumComplex<f64>) -> Self {
vecnumcomplex.data.iter().map(|x| Rcplx::from((x.re, x.im))).collect::<Complexes>()
}
}
pub fn fft(complexes: Complexes) -> Complexes {
let length = complexes.len();
let mut planner = FftPlanner::<f64>::new();
let fft = planner.plan_fft_forward(length);
let mut indata: Vec<Complex<f64>> = VecNumComplex::from(complexes).data;
fft.process(&mut indata);
Complexes::from(VecNumComplex { data: indata })
// check how multichannel fft works for librosa
}
Cargo.toml
[package]
name = 'audior'
version = '0.1.0'
edition = '2018'
[lib]
crate-type = [ 'staticlib' ]
[dependencies]
extendr-api = { version = '0.3.1', features = ["ndarray"] }
claxon = '0.4.3'
hound = '3.4.0'
ndarray = '0.15.4'
rubato = '0.12.0'
symphonia = '0.5.0'
realfft = '3.0.0'
num = '0.4.0'
rustfft = '6.0.1'
Ok, you do all of this in every iteration then:
let mut planner = FftPlanner::<f64>::new();
let fft = planner.plan_fft_forward(length);
...
fft.process(&mut indata);
The first two steps are expensive and should only be done once. Then you reuse that fft instance for every forward fft of the same length. Doing all of this in every iteration really slows it down. Torch may be doing some clever caching behind the scenes that gets around this.
Here's what I get when doing only 1 iteration per input size, so I think in this case it would be right to redo these 2 first line at every step. Torch is now a little closer!
It seems that there really are some cache being done here. I will check what I can do to improve on this aspect. I don't know if the fact of repeating the same value for the entire input is also enabling some other cache being done by torch.
code used:
results <- bench::press(
n = seq(17, 30000, 199),
{
x = rep(1+1i, n)
py$x = x
y = torch_tensor(x)
py_run_string("import numpy as np", convert = FALSE)
mark(
audior::fft(x),
stats::fft(x),
torch::torch_fft_fft(y),
py_eval("np.fft.fft(x)", convert = FALSE),
iterations = 1,
check = FALSE
)
}
)
ggplot(results) + geom_point(aes(x = n, y = median, color = as.character(expression))) +
scale_colour_discrete(name = "implementation", labels=c("rustfft", "numpy.fft.fft (via reticulate)", "stats::fft", "torch::torch_fft_fft"))
Here's what I get when doing only 1 iteration per input size, so I think in this case it would be right to redo these 2 first line at every step. Torch is now a little closer!
True, but you could still reuse the planner between lengths since it can then reuse some internal data even when constructing an fft of a new length. Note that with only a single iteration the time you measure (for most lengths at least) is dominated by the setup time for the fft. Is that how you will be using the fft? Or will you be doing many ffts of the same length?
Definitely the second case. I will try to work on the cache to see if I improve both rustfft and realfft and come back here with more comparisons.
Now I did some cache implementations using the cached
crate.
It looks like I am now very close to what you got.
For n = 1024:
For n = 131072:
For n from 17 to 30000 in steps of 299 (17, 316, 615...):
For n powers of 2 (2, 4, 8, 16..., 1048576):
I actually don't know if this torch
comparison is being honest, since I am not converting to R after it returns its ffted tensor. I don't know which one would be a fair comparison. If I do the input conversion to tensor and the output conversion to an R array inside the benchmark's calculation, torch's performance suffers a huge downgrade (difference between purple and red in the curve below).
code used for caching:
pub fn fft(complexes: Complexes) -> Complexes {
let length = complexes.len();
let fft = create_planner(length);
let mut indata: Vec<Complex<f64>> = VecNumComplex::from(complexes).data;
fft.process(&mut indata);
Complexes::from(VecNumComplex { data: indata })
}
#[cached]
pub fn create_planner(length: usize) -> Arc<dyn Fft<f64>> {
let mut planner = FftPlanner::<f64>::new();
planner.plan_fft_forward(length)
}
It's a bit tricky to compare. I guess what you need in your R application is to transform an R array, and get the result as another R array? And you do the benchmarks to figure out which FFT library is the best choice for you? If both of those are correct, then the conversions to/from R should be included in all the benchmarks. It doesn't help you if for example Torch transforms superfast if it then takes a very long time to transform the data into something you can use..
Thanks for the tips! I will let you know if I find other strange things in these benchmarks I am doing.
Hi! I am porting some audio stuff to R and decided to benchmark rustfft against some other known implementations.
I used extendr crate to port rustfft to R and the reticulate package to run numpy from R. To my surprise the torch (no GPU) implementation seems to do better than all the others for n > 5000 points. Do you have any idea why would Torch's implementation be faster than the others?
code: