PyO3 / pyo3

Rust bindings for the Python interpreter
https://pyo3.rs
Apache License 2.0
12.29k stars 758 forks source link

Performance issue when calling rust function in python #3787

Closed richecr closed 3 weeks ago

richecr commented 9 months ago

I have created this simple function:

use pyo3::prelude::*;
use std::time::{Instant};

#[pyfunction]
fn rust_sleep() -> i32 {
    let start = Instant::now();
    let num = 1 + 1;
    let duration = start.elapsed();
    println!("{:?}", duration);
    num
}

#[pymodule]
fn pythonicsqlrust(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}
import time
import pythonicsqlrust

def main():
    s = time.time_ns()
    print(pythonicsqlrust.rust_sleep())
    e = time.time_ns()
    print(e - s)

main()

In pure rust it takes ~60ns and when called in python ~22350ns. Can you help understand why this happens? (is slower than running in pure python)

davidhewitt commented 9 months ago

Thanks @richecr for the report.

TLDR:


First, I'll assume that when you say "is slower than running in pure Python" that you're testing with against this Python implementation:

def py_sleep():
    start = time.time_ns()
    num = 1 + 1
    duration = time.time_ns() - start
    print(duration)
    return num

There's several different factors coming into play. Let's try to break these down:

In the end, let's end up with this code:

use pyo3::prelude::*;
use std::time::Instant;

#[pyfunction]
fn rust_sleep() -> i32 {
    let start = Instant::now();
    let num = 1 + 1;
    let duration = start.elapsed();
    println!("{}", duration.as_nanos());
    num
}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}
import time
from timeit import timeit
import pyo3_scratch

def py_sleep():
    s = time.time_ns()
    x = 1 + 1
    e = time.time_ns()
    print(e - s)
    return x

# run some warmups
pyo3_scratch.rust_sleep()
py_sleep()

# measure average duration of 1 million calls
N = 1_000_000
py = timeit("py_sleep()", setup="from __main__ import py_sleep", number=N) / N
rust = timeit("rust_sleep()", setup="from pyo3_scratch import rust_sleep", number=N) / N

# report final timings
print("py", py)
print("rust", rust)

Now, running this, I get the following output:

py 4.560636853999767e-06
rust 4.483858657999917e-06

There is still volatility in these numbers; sometimes Rust is a little slower than Python, sometimes Rust is a little faster. Overall, both are reporting around 4.5us on my machine, which to me makes me assume the work on both languages is dominated here by the system-level operations: timing measurements and writing to stdout.


Let's try to measure the call overhead more precisely by making both of these functions into noops:

use pyo3::prelude::*;

#[pyfunction]
fn rust_sleep() {}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}
from timeit import timeit
import pyo3_scratch

def py_sleep():
    return

# run some warmups
pyo3_scratch.rust_sleep()
py_sleep()

# measure average duration of 1 million calls
N = 1_000_000
py = timeit("py_sleep()", setup="from __main__ import py_sleep", number=N) / N
rust = timeit("rust_sleep()", setup="from pyo3_scratch import rust_sleep", number=N) / N

print("py", py)
print("rust", rust)

Now on my machine I get less volatility, and pure-Python shows an edge:

py 1.899612299985165e-08
rust 2.5328653000087797e-08

What we see is that calling the noop Python function is measuring as taking 18ns, and 25ns to call into Rust. This 7ns slowdown is a fairer estimate of the slowdown which PyO3 currently exhibits over pure-Python function calls.


Finally, let's go a step further and estimate what PyO3 could look like with the overheads which we're working to remove in PyO3 0.21. I'll apply the following diff to current PyO3 main, which is a crude way to disable framework-level overheads we're working to remove in #3382:

diff --git a/src/impl_/trampoline.rs b/src/impl_/trampoline.rs
index 4b4eac17a..2664d7598 100644
--- a/src/impl_/trampoline.rs
+++ b/src/impl_/trampoline.rs
@@ -174,8 +174,9 @@ where
     R: PyCallbackOutput,
 {
     let trap = PanicTrap::new("uncaught panic at ffi boundary");
-    let pool = unsafe { GILPool::new() };
-    let py = pool.python();
+    // let pool = unsafe { GILPool::new() };
+    // let py = pool.python();
+    let py = unsafe { Python::assume_gil_acquired() };
     let out = panic_result_into_callback_output(
         py,
         panic::catch_unwind(move || -> PyResult<_> { body(py) }),

(NB do not attempt to apply the above diff and run this for real world code in production. Until PyO3 is fully transitioned to the new API, the GILPool is a fundamental part of correct operation of PyO3.)

This reverses the situation. For the noop function calls, once we've sorted out this framework-level overhead, calling a noop PyO3 function will be faster than calling a noop pure-Python one, by about ~4.5ns on my machine:

py 1.5801341000042156e-08
rust 1.1012634000508115e-08
Paulo-21 commented 8 months ago

Hello, I have the same performance issue with a lot of gap. image

tic = time.perf_counter()
raw_features = af_reader_py.compute_features(af_path, page_rank, degree_centrality, in_degrees, out_degrees, 10000, 0.00001 )
print("Python wrappe the function : ", time.perf_counter()-tic, " sec")

we can see that the rust code take 6.8 sec and it take nearly 10 sec to return the output to python.

fn compute_features(file_path : &str, page_rank : Vec<f64>, degree_centrality: Vec<f64>,  in_degree: Vec<f64>, out_degree: Vec<f64>, iter:usize, tol : f64)-> PyResult<Vec<[f64;11]>> {
    let start = Instant::now();
    let edge = reading_cnf_for_rustworkx(file_path);
    let (hcat, card, noselfatt, maxbased, gr, ) = reading_cnf_with_semantics(file_path);
    let g = petgraph::graph::DiGraph::<u32, ()>::from_edges(&edge);
    let eig = eigenvector_centrality(&g,  |_| {Ok::<f64,f64>(1.)}, Some(iter), Some(tol)).unwrap().unwrap();
    let coloring = greedy_node_color(&g);
    let mut raw_features = Vec::with_capacity(g.node_count());
    for node in 0..page_rank.len() {
        raw_features.push([
            coloring[node] as f64,page_rank[node],
            degree_centrality[node], eig[node],
            in_degree[node],out_degree[node],
            hcat[node],card[node],noselfatt[node],
            maxbased[node],gr[node]
        ]);
    }
    println!("Inside Rust {} ms", start.elapsed().as_millis());
    //.into()
    Ok(raw_features)
}

Did i do somethings wrong or it due to the pyo3 implementation ? ty

birkenfeld commented 8 months ago

Did i do somethings wrong or it due to the pyo3 implementation ? ty

Keep in mind that the Vecs you pass in and out of the function have to be converted between Python and Rust representation. How many elements are in them, roughly?

davidhewitt commented 8 months ago

@Paulo-21 further to the above, you may want to consider using rust-numpy to convert Rust matrix types directly to numpy arrays and avoid the list-of-list-of-floats. Should be significantly faster.

Paulo-21 commented 8 months ago

Did i do somethings wrong or it due to the pyo3 implementation ? ty

Keep in mind that the Vecs you pass in and out of the function have to be converted between Python and Rust representation. How many elements are in them, roughly?

Yes i understand, it's roughly 2.5 Million elements.

@Paulo-21 further to the above, you may want to consider using rust-numpy to convert Rust matrix types directly to numpy arrays and avoid the list-of-list-of-floats. Should be significantly faster.

Thank you, i will give a try !

davidhewitt commented 3 weeks ago

With 0.22 (without the GIL Refs feature) and also on the upcoming 0.23, we now have the changes I mentioned above completed, and I consistently measure calling a noop PyO3 function as faster than calling a noop Python one.

I will close this issue, I'm sure we will yet find more cases to optimise in future, they can be new issues.