PyO3 / pyo3

Rust bindings for the Python interpreter
https://pyo3.rs
Apache License 2.0
12.29k stars 758 forks source link

Avoiding struct method call overhead of `extract_pyclass_ref_mut` #3843

Open Thell opened 8 months ago

Thell commented 8 months ago

Firstly, thanks for PyO3, it's great!

This issue is being opened following a [chat post on gitter].(https://matrix.to/#/!AAhjIWoaKSExrkkhlG:gitter.im/$9gOL75NMgSzKVrKugcQUMBUFcy5QMpLvxuJafAHLieU?via=gitter.im&via=matrix.org&via=nitro.chat)

In short, consider these performance ordered cargo bench results:

running 5 tests
test tests::bench_xoshiro_struct       ... bench:  56,930,651 ns/iter (+/- 145,472)
test tests::bench_lcg_struct           ... bench:  86,424,781 ns/iter (+/- 337,028)
test tests::bench_lcg_static_lazy      ... bench:  86,614,389 ns/iter (+/- 781,944)
test tests::bench_lcg_static           ... bench:  87,029,922 ns/iter (+/- 353,495)
test tests::bench_xoshiro_static_lazy  ... bench: 108,713,707 ns/iter (+/- 1,500,070)

and contrast that to these timeit results:

lcg_static:            2.2382071590000123
lcg_static_lazy:       2.2475717499983148
xoshiro_static_lazy:   2.333924032998766
lcg_struct:            2.603333091999957
xoshiro_struct:        2.6484997750012553

See something odd there? 🤓 Hey structs! What is going on!?!? Sure we have overhead but all of these test functions are returning a simple type to python and, yeah, the structs have to be mutable but for xoshiro_struct to go from being about twice as fast as the lazy to about 14% slower is unexpected, and to see both of the 'struct' versions leading in the Rust benches to trailing in the Python timeits is unexpected.

Running some long pytest loops on these to get enough samples to see what's going on reveals that our two structs are getting penalized for being structs:

image

The question is can this be avoided through some hint/annotation to short-circuit that extract/type-info stack?

Here's one of the minimal structs:

#[pyclass]
pub struct XoshiroStruct {
    state: Xoshiro256Plus,
}

impl XoshiroStruct {
    fn next_state(&mut self) -> u64 {
        self.state.next_u64()
    }
}

#[pymethods]
impl XoshiroStruct {
    #[new]
    pub fn new() -> Self {
        Self {
            state: Xoshiro256Plus::from_entropy(),
        }
    }

    pub fn do_something(&mut self) -> u64 {
        do_it(self.next_state())
    }
}

Here's the profile: pytest 2024-02-15 22.59 profile.json.gz

And here's a repo with these examples: https://github.com/Thell/struct_perf

adamreichold commented 8 months ago

Generally speaking, some amount of overhead is unavoidable due pervasive shared mutability applied in Python which requires us to use interior mutability patterns to enable safe Rust code from obtaining a &mut Self at all.

You might be able to reduce that overhead by opting out of dynamic borrow check by marking your pyclass as frozen using #[pyclass(frozen)] (check the guide for details). This does imply that do_something must take &self and hence you need to manage the interior mutability yourself. But since you seem to be benchmarking small PRNG, using std::cell::Cell should be possible with doing reference checking, e.g.

let mut state = self.state.get();
let value = self.next_state();
self.state.set(state);
Thell commented 8 months ago

Thanks for the reply @adamreichold. Unfortunately that doesn't seem to have negated the "extract" cost, although it is now extract_pyclass_ref instead of extract_pyclass_ref_mut. :)

The image cut off the struct declaration:

#[pyclass(frozen)]
pub struct XoshiroStruct {
    state: Cell<Xoshiro256Plus>,
}

image

note: if anyone else tries to do Cell<Xoshiro256Plus> be warned you'll need to use a local rand_xoshiro and add the impl for Copy since it only has Clone in the official repo.

davidhewitt commented 8 months ago

So from your flamegraphs it looks like the time is dominated by retrieval of the type object, which in general is a necessary part of the type check. I've got a couple of thoughts here:

  1. @Thell can you try using lto = true in your Cargo.toml if you are not already doing so? I wonder if more aggressive inlining can help here.
  2. I wonder if this is a hint that we should be exploring storing these type objects within in the module state; the "lazy type object" machinery seems to be doing a lot of work. (This would be a bigger refactor of PyO3).
  3. We don't currently have a flag to opt out of the type check. Depending on the protection that CPython gives us, this type check may actually not be needed at all.

@samuelcolvin might have hit the same thing when observing "methods overhead" in #3827

Thell commented 8 months ago

@davidhewitt It's set already.