Closed enavarro51 closed 6 months ago
In general there is a tradeoff we make with NodeIndices
and the other custom return types. When we have the custom return types in rustworkx this is in an effort to return faster and optimize for workloads where we iterate over the object once vs places we access the elements in the results multiple times.
Basically when you return Vec<usize>
what that ends up doing is building a new Python list and then copying and converting each element in from usize -> Python int. Which depending on the size of the list can be a lot of overhead. While NodeIndices
avoids that by wrapping the output vec in a custom pyclass so there is no copy/conversion overhead on return. The tradeoff there is that when you access any element of NodeIndices
it has to do the usize -> python int conversion. So if you loop over the values multiple times that ends up being slower (and in those cases I'd just do a list(indices)
to convert it to a python list in a single iteration.
Now that all being said we added those custom return types are something we added years ago, and PyO3 was much less mature back then. It is entirely possible that the overhead of creating the Python List[int]
isn't as bad anymore and the custom return type's overhead on access isn't worth it now with newer versions of pyo3.
The DAGDependency
code calls predecessor_indices
about once for each node and then iterates over those indices once. I assume the overhead is in the __getitem__
for NodeIndices
.
I have a pretty good test environment set up, so I can do some checks on just a single iteration of the indices for a large number of items.
I ran a non-scientific experiment on my machine and I believe the optimizations Matthew introduced still hold true.
Here is the code for getting the times:
import rustworkx as rx
import timeit
N = 1_000_000
g = rx.PyGraph()
g.add_nodes_from(list(range(N)))
nodes = g.node_indexes()
print(
"Getting the node indices or list from Rust a thousand times: ", timeit.timeit("g.node_indexes()", globals=globals(), number=1_000)
)
print(
"Accessing a negative index a billion times: ", timeit.timeit("nodes[-10]", globals=globals(), number=1_000_000_000)
)
I left node_indexes
as the original:
#[pyo3(text_signature = "(self)")]
pub fn node_indexes(&self) -> NodeIndices {
self.node_indices()
}
The results were:
Getting the node indices or list from Rust a thousand times: 1.6359028699998817
Accessing a negative index a billion times: 674.1754225720001
I modified the node_indexes
code to be:
#[pyo3(text_signature = "(self)")]
pub fn node_indexes(&self) -> Vec<usize> {
self.graph.node_indices().map(|node| node.index()).collect()
}
The results were:
Getting the node indices or list from Rust a thousand times: 32.542972673999884
Accessing a negative index a billion times: 28.010249427000417
Returning a Python list from Rust is slow. So to make a function return faster it makes sense.
However, accessing a specific element of the Rust object is very slow comparing to accessing a specific element of a Python list.
I think we should definitely profile the __getitem__
method though: https://github.com/Qiskit/rustworkx/blob/52a4d05afe4aa2dffa22f928cf89ba368c64abd6/src/iterators.rs#L540
We need to see where it spends time and maybe work with the maintainers of PyO3 to see if we can make it faster. The current overhead is very high
It turns out that having
[derive(FromPyObject)]
enum SliceOrInt {
Slice(PySlice),
Int(...),
}
means that PyO3 attempts to downcast to Slice
first, but like 99.9% of all __getitem__
calls are going to be with an int
(especially since the Python implicit-iterator behaviour is being used here).
Flipping the order in the enum for me switched nodes[-10]
(from Ivan's code block above) from taking 205ns to taking 61ns with no other changes (for comparison, indexing into the equivalent Python list took 25ns, but PyO3 will have some springboard cost associated with it too, some of which will be improved by PyO3 0.21).
I think another change to potentially consider is adding custom direct Iterator
structs into the custom_vec_iter_impl
that look like
#[pyclass]
struct NodeIndicesIter {
base: Py<NodeIndices>,
index: usize,
}
#[pymethods]
impl NodeIndicesIter {
fn __next__(&mut self, py: Python) -> PyResult<Py<PyAny>> {
// ...
}
fn __iter__(&self) -> Self { self }
which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.
I think another change to potentially consider is adding custom direct Iterator structs into the custom_vec_iter_impl that look like which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.
We have support like that with the mapping type macro right now (to get keys, values, and items iterators), so I think it'd be a good idea to add it to the vec ones too. I expect that will be a bit of a speed boost for a lot of cases since we avoid more python/pyo3 overhead by going through __getitem__
My unrelated idea to also speed things up here was maybe to cache the output PyObject
s on __getitem__
so that subsequent accesses don't eat the conversion cost. The tradeoff there is we're essentially doubling the memory overhead for these classes because we'd also need to store a Vec<Option<PyObject>>
of the same length in each object. So I'm not sure if it's worth it or not.
While working on
DAGDependencyV2
in Qiskit, I was usingdigraph.predecessor_indices
anddigraph.predecessors
and noticedpredecessors
was significantly faster thanpredecessor_indices
. Looking at thedigraph
code, it appearedpredecessor_indices
was simpler and therefore should have been faster.I noticed that
predecessors
returned aVec
andpredecessor_indices
returnedNodeIndices
. I modified the latter function to returnVec<usize>
and the tests I was running forDAGDependencyV2
went from 74 sec forNodeIndices
to 12 sec forVec<usize>
constructing an equivalentDAGDependencyV2
.Would appreciate any comments on why this is happening and whether this might be a wider problem.
Main code
Modified code