Open Thell opened 8 months ago
Generally speaking, some amount of overhead is unavoidable due pervasive shared mutability applied in Python which requires us to use interior mutability patterns to enable safe Rust code from obtaining a &mut Self
at all.
You might be able to reduce that overhead by opting out of dynamic borrow check by marking your pyclass as frozen using #[pyclass(frozen)]
(check the guide for details). This does imply that do_something
must take &self
and hence you need to manage the interior mutability yourself. But since you seem to be benchmarking small PRNG, using std::cell::Cell
should be possible with doing reference checking, e.g.
let mut state = self.state.get();
let value = self.next_state();
self.state.set(state);
Thanks for the reply @adamreichold. Unfortunately that doesn't seem to have negated the "extract" cost, although it is now extract_pyclass_ref
instead of extract_pyclass_ref_mut
. :)
The image cut off the struct declaration:
#[pyclass(frozen)]
pub struct XoshiroStruct {
state: Cell<Xoshiro256Plus>,
}
note: if anyone else tries to do Cell<Xoshiro256Plus>
be warned you'll need to use a local rand_xoshiro and add the impl for Copy since it only has Clone in the official repo.
So from your flamegraphs it looks like the time is dominated by retrieval of the type object, which in general is a necessary part of the type check. I've got a couple of thoughts here:
lto = true
in your Cargo.toml
if you are not already doing so? I wonder if more aggressive inlining can help here.@samuelcolvin might have hit the same thing when observing "methods overhead" in #3827
Firstly, thanks for PyO3, it's great!
This issue is being opened following a [chat post on gitter].(https://matrix.to/#/!AAhjIWoaKSExrkkhlG:gitter.im/$9gOL75NMgSzKVrKugcQUMBUFcy5QMpLvxuJafAHLieU?via=gitter.im&via=matrix.org&via=nitro.chat)
In short, consider these performance ordered cargo bench results:
and contrast that to these timeit results:
See something odd there? 🤓 Hey structs! What is going on!?!? Sure we have overhead but all of these test functions are returning a simple type to python and, yeah, the structs have to be mutable but for xoshiro_struct to go from being about twice as fast as the lazy to about 14% slower is unexpected, and to see both of the 'struct' versions leading in the Rust benches to trailing in the Python timeits is unexpected.
Running some long pytest loops on these to get enough samples to see what's going on reveals that our two structs are getting penalized for being structs:
The question is can this be avoided through some hint/annotation to short-circuit that extract/type-info stack?
Here's one of the minimal structs:
Here's the profile: pytest 2024-02-15 22.59 profile.json.gz
And here's a repo with these examples: https://github.com/Thell/struct_perf