Performance issue with `custom_vec_iter_impl!` `NodeIndices`

Qiskit / rustworkx

A high performance Python graph library implemented in Rust.

https://www.rustworkx.org

Apache License 2.0

1.05k stars 146 forks source link

Performance issue with `custom_vec_iter_impl!` `NodeIndices` #1090

Closed enavarro51 closed 6 months ago

enavarro51 commented 7 months ago

While working on DAGDependencyV2 in Qiskit, I was using digraph.predecessor_indices and digraph.predecessors and noticed predecessors was significantly faster than predecessor_indices. Looking at the digraph code, it appeared predecessor_indices was simpler and therefore should have been faster.

I noticed that predecessors returned a Vec and predecessor_indices returned NodeIndices. I modified the latter function to return Vec<usize> and the tests I was running for DAGDependencyV2 went from 74 sec for NodeIndices to 12 sec for Vec<usize> constructing an equivalent DAGDependencyV2.

Would appreciate any comments on why this is happening and whether this might be a wider problem.

Main code

    #[pyo3(text_signature = "(self, node, /)")]
    pub fn predecessor_indices(&self, node: usize) -> NodeIndices {
        NodeIndices {
            nodes: self
                .graph
                .neighbors_directed(NodeIndex::new(node), petgraph::Direction::Incoming)
                .map(|node| node.index())
                .collect(),
        }
    }

Modified code

    #[pyo3(text_signature = "(self, node, /)")]
    pub fn predecessor_indices(&self, node: usize) -> Vec<usize> {
        self.graph
            .neighbors_directed(NodeIndex::new(node), petgraph::Direction::Incoming)
            .map(|node| node.index())
            .collect()
    }

mtreinish commented 7 months ago

In general there is a tradeoff we make with NodeIndices and the other custom return types. When we have the custom return types in rustworkx this is in an effort to return faster and optimize for workloads where we iterate over the object once vs places we access the elements in the results multiple times.

Basically when you return Vec<usize> what that ends up doing is building a new Python list and then copying and converting each element in from usize -> Python int. Which depending on the size of the list can be a lot of overhead. While NodeIndices avoids that by wrapping the output vec in a custom pyclass so there is no copy/conversion overhead on return. The tradeoff there is that when you access any element of NodeIndices it has to do the usize -> python int conversion. So if you loop over the values multiple times that ends up being slower (and in those cases I'd just do a list(indices) to convert it to a python list in a single iteration.

Now that all being said we added those custom return types are something we added years ago, and PyO3 was much less mature back then. It is entirely possible that the overhead of creating the Python List[int] isn't as bad anymore and the custom return type's overhead on access isn't worth it now with newer versions of pyo3.

enavarro51 commented 7 months ago

The DAGDependency code calls predecessor_indices about once for each node and then iterates over those indices once. I assume the overhead is in the __getitem__ for NodeIndices.

I have a pretty good test environment set up, so I can do some checks on just a single iteration of the indices for a large number of items.

IvanIsCoding commented 7 months ago

I ran a non-scientific experiment on my machine and I believe the optimizations Matthew introduced still hold true.

Here is the code for getting the times:

import rustworkx as rx
import timeit

N = 1_000_000
g = rx.PyGraph()
g.add_nodes_from(list(range(N)))
nodes = g.node_indexes()

print(
    "Getting the node indices or list from Rust a thousand times: ", timeit.timeit("g.node_indexes()", globals=globals(), number=1_000)
)

print(
    "Accessing a negative index a billion times: ", timeit.timeit("nodes[-10]", globals=globals(), number=1_000_000_000)
)

Attempt 1

I left node_indexes as the original:

#[pyo3(text_signature = "(self)")]
    pub fn node_indexes(&self) -> NodeIndices {
        self.node_indices()
    }

The results were:

Getting the node indices or list from Rust a thousand times:  1.6359028699998817
Accessing a negative index a billion times:  674.1754225720001

Atempt 2

I modified the node_indexes code to be:

#[pyo3(text_signature = "(self)")]
    pub fn node_indexes(&self) -> Vec<usize> {
        self.graph.node_indices().map(|node| node.index()).collect()
    }

The results were:

Getting the node indices or list from Rust a thousand times:  32.542972673999884
Accessing a negative index a billion times:  28.010249427000417

Corollary

Returning a Python list from Rust is slow. So to make a function return faster it makes sense.

However, accessing a specific element of the Rust object is very slow comparing to accessing a specific element of a Python list.

IvanIsCoding commented 7 months ago

I think we should definitely profile the __getitem__ method though: https://github.com/Qiskit/rustworkx/blob/52a4d05afe4aa2dffa22f928cf89ba368c64abd6/src/iterators.rs#L540

We need to see where it spends time and maybe work with the maintainers of PyO3 to see if we can make it faster. The current overhead is very high

jakelishman commented 6 months ago

It turns out that having

[derive(FromPyObject)]
enum SliceOrInt {
    Slice(PySlice),
    Int(...),
}

means that PyO3 attempts to downcast to Slice first, but like 99.9% of all __getitem__ calls are going to be with an int (especially since the Python implicit-iterator behaviour is being used here).

Flipping the order in the enum for me switched nodes[-10] (from Ivan's code block above) from taking 205ns to taking 61ns with no other changes (for comparison, indexing into the equivalent Python list took 25ns, but PyO3 will have some springboard cost associated with it too, some of which will be improved by PyO3 0.21).

I think another change to potentially consider is adding custom direct Iterator structs into the custom_vec_iter_impl that look like

#[pyclass]
struct NodeIndicesIter {
    base: Py<NodeIndices>,
    index: usize,
}

#[pymethods]
impl NodeIndicesIter {
    fn __next__(&mut self, py: Python) -> PyResult<Py<PyAny>> {
        // ...
    }
    fn __iter__(&self) -> Self { self }

which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.

mtreinish commented 6 months ago

I think another change to potentially consider is adding custom direct Iterator structs into the custom_vec_iter_impl that look like which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.

We have support like that with the mapping type macro right now (to get keys, values, and items iterators), so I think it'd be a good idea to add it to the vec ones too. I expect that will be a bit of a speed boost for a lot of cases since we avoid more python/pyo3 overhead by going through __getitem__

My unrelated idea to also speed things up here was maybe to cache the output PyObjects on __getitem__ so that subsequent accesses don't eat the conversion cost. The tradeoff there is we're essentially doubling the memory overhead for these classes because we'd also need to store a Vec<Option<PyObject>> of the same length in each object. So I'm not sure if it's worth it or not.