Performance issue when calling rust function in python

richecr commented 9 months ago

I have created this simple function:

use pyo3::prelude::*;
use std::time::{Instant};

#[pyfunction]
fn rust_sleep() -> i32 {
    let start = Instant::now();
    let num = 1 + 1;
    let duration = start.elapsed();
    println!("{:?}", duration);
    num
}

#[pymodule]
fn pythonicsqlrust(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}

import time
import pythonicsqlrust

def main():
    s = time.time_ns()
    print(pythonicsqlrust.rust_sleep())
    e = time.time_ns()
    print(e - s)

main()

In pure rust it takes ~60ns and when called in python ~22350ns. Can you help understand why this happens? (is slower than running in pure python)

davidhewitt commented 9 months ago

Thanks @richecr for the report.

TLDR:

Yes, at the moment PyO3 does introduce some framework-level overheads which make a PyO3 function call just slightly slower than a Python function call. We intend to remove this. Nevertheless, this overhead is tiny, on the nanosecond scale.
Despite this, you should not be surprised if for more complex functions the cost of converting Python types into Rust types continues to show a penalty.
For real-world workloads, you should find that the execution speed of the Rust code outstrips the Python equivalent. There are many blog posts showcasing examples of this, some which can be found on the PyO3 README.

First, I'll assume that when you say "is slower than running in pure Python" that you're testing with against this Python implementation:

def py_sleep():
    start = time.time_ns()
    num = 1 + 1
    duration = time.time_ns() - start
    print(duration)
    return num

There's several different factors coming into play. Let's try to break these down:

In both cases you are making a Python function call. In one case, that runs native Rust code, in the other case, it runs interpreted Python. There is a fundamental baseline cost to running a Python function call.
Measuring these kind of microbenchmarks is really hard. There will be a lot of volatility in these measurements from lots of hardware-level effects, such as CPU branch prediction, frequency boosting, cache locality etc. Let's take a large sample and average that to get a truer measurement.
As you're printing to stdout and also reading system time, it's quite possible there will be some one-time initialization, so let's run a warmup to eliminate those.
Let's check your Rust compile settings. You should be using --release build at a minimum.
The Rust is doing a tiny bit of extra formatting for the output to compute the scale of the duration and add ns. Let's make it a bit fairer by taking that out for now.

In the end, let's end up with this code:

use pyo3::prelude::*;
use std::time::Instant;

#[pyfunction]
fn rust_sleep() -> i32 {
    let start = Instant::now();
    let num = 1 + 1;
    let duration = start.elapsed();
    println!("{}", duration.as_nanos());
    num
}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}

import time
from timeit import timeit
import pyo3_scratch

def py_sleep():
    s = time.time_ns()
    x = 1 + 1
    e = time.time_ns()
    print(e - s)
    return x

# run some warmups
pyo3_scratch.rust_sleep()
py_sleep()

# measure average duration of 1 million calls
N = 1_000_000
py = timeit("py_sleep()", setup="from __main__ import py_sleep", number=N) / N
rust = timeit("rust_sleep()", setup="from pyo3_scratch import rust_sleep", number=N) / N

# report final timings
print("py", py)
print("rust", rust)

Now, running this, I get the following output:

py 4.560636853999767e-06
rust 4.483858657999917e-06

There is still volatility in these numbers; sometimes Rust is a little slower than Python, sometimes Rust is a little faster. Overall, both are reporting around 4.5us on my machine, which to me makes me assume the work on both languages is dominated here by the system-level operations: timing measurements and writing to stdout.

Let's try to measure the call overhead more precisely by making both of these functions into noops:

use pyo3::prelude::*;

#[pyfunction]
fn rust_sleep() {}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}

from timeit import timeit
import pyo3_scratch

def py_sleep():
    return

# run some warmups
pyo3_scratch.rust_sleep()
py_sleep()

# measure average duration of 1 million calls
N = 1_000_000
py = timeit("py_sleep()", setup="from __main__ import py_sleep", number=N) / N
rust = timeit("rust_sleep()", setup="from pyo3_scratch import rust_sleep", number=N) / N

print("py", py)
print("rust", rust)

Now on my machine I get less volatility, and pure-Python shows an edge:

py 1.899612299985165e-08
rust 2.5328653000087797e-08

What we see is that calling the noop Python function is measuring as taking 18ns, and 25ns to call into Rust. This 7ns slowdown is a fairer estimate of the slowdown which PyO3 currently exhibits over pure-Python function calls.

Finally, let's go a step further and estimate what PyO3 could look like with the overheads which we're working to remove in PyO3 0.21. I'll apply the following diff to current PyO3 main, which is a crude way to disable framework-level overheads we're working to remove in #3382:

diff --git a/src/impl_/trampoline.rs b/src/impl_/trampoline.rs
index 4b4eac17a..2664d7598 100644
--- a/src/impl_/trampoline.rs
+++ b/src/impl_/trampoline.rs
@@ -174,8 +174,9 @@ where
     R: PyCallbackOutput,
 {
     let trap = PanicTrap::new("uncaught panic at ffi boundary");
-    let pool = unsafe { GILPool::new() };
-    let py = pool.python();
+    // let pool = unsafe { GILPool::new() };
+    // let py = pool.python();
+    let py = unsafe { Python::assume_gil_acquired() };
     let out = panic_result_into_callback_output(
         py,
         panic::catch_unwind(move || -> PyResult<_> { body(py) }),

(NB do not attempt to apply the above diff and run this for real world code in production. Until PyO3 is fully transitioned to the new API, the GILPool is a fundamental part of correct operation of PyO3.)

This reverses the situation. For the noop function calls, once we've sorted out this framework-level overhead, calling a noop PyO3 function will be faster than calling a noop pure-Python one, by about ~4.5ns on my machine:

py 1.5801341000042156e-08
rust 1.1012634000508115e-08

Paulo-21 commented 8 months ago

Hello, I have the same performance issue with a lot of gap.

tic = time.perf_counter()
raw_features = af_reader_py.compute_features(af_path, page_rank, degree_centrality, in_degrees, out_degrees, 10000, 0.00001 )
print("Python wrappe the function : ", time.perf_counter()-tic, " sec")

we can see that the rust code take 6.8 sec and it take nearly 10 sec to return the output to python.

fn compute_features(file_path : &str, page_rank : Vec<f64>, degree_centrality: Vec<f64>,  in_degree: Vec<f64>, out_degree: Vec<f64>, iter:usize, tol : f64)-> PyResult<Vec<[f64;11]>> {
    let start = Instant::now();
    let edge = reading_cnf_for_rustworkx(file_path);
    let (hcat, card, noselfatt, maxbased, gr, ) = reading_cnf_with_semantics(file_path);
    let g = petgraph::graph::DiGraph::<u32, ()>::from_edges(&edge);
    let eig = eigenvector_centrality(&g,  |_| {Ok::<f64,f64>(1.)}, Some(iter), Some(tol)).unwrap().unwrap();
    let coloring = greedy_node_color(&g);
    let mut raw_features = Vec::with_capacity(g.node_count());
    for node in 0..page_rank.len() {
        raw_features.push([
            coloring[node] as f64,page_rank[node],
            degree_centrality[node], eig[node],
            in_degree[node],out_degree[node],
            hcat[node],card[node],noselfatt[node],
            maxbased[node],gr[node]
        ]);
    }
    println!("Inside Rust {} ms", start.elapsed().as_millis());
    //.into()
    Ok(raw_features)
}

Did i do somethings wrong or it due to the pyo3 implementation ? ty

birkenfeld commented 8 months ago

Did i do somethings wrong or it due to the pyo3 implementation ? ty

Keep in mind that the Vecs you pass in and out of the function have to be converted between Python and Rust representation. How many elements are in them, roughly?

davidhewitt commented 8 months ago

@Paulo-21 further to the above, you may want to consider using rust-numpy to convert Rust matrix types directly to numpy arrays and avoid the list-of-list-of-floats. Should be significantly faster.

Paulo-21 commented 8 months ago

Did i do somethings wrong or it due to the pyo3 implementation ? ty

Keep in mind that the Vecs you pass in and out of the function have to be converted between Python and Rust representation. How many elements are in them, roughly?

Yes i understand, it's roughly 2.5 Million elements.

@Paulo-21 further to the above, you may want to consider using rust-numpy to convert Rust matrix types directly to numpy arrays and avoid the list-of-list-of-floats. Should be significantly faster.

Thank you, i will give a try !

davidhewitt commented 3 weeks ago

With 0.22 (without the GIL Refs feature) and also on the upcoming 0.23, we now have the changes I mentioned above completed, and I consistently measure calling a noop PyO3 function as faster than calling a noop Python one.

I will close this issue, I'm sure we will yet find more cases to optimise in future, they can be new issues.

PyO3 / pyo3

Performance issue when calling rust function in python #3787