Performance regression caused by dagnode op property

mtreinish commented 3 years ago

Information

Qiskit Terra version: main after https://github.com/Qiskit/qiskit-terra/commit/3c979ebd
Python version: any
Operating system: any

What is the current behavior?

After PR #6199 merged a regression was flagged on our nightly benchmarks for certain transpiler passes where we're accessing a dag node's op frequently. For example:

https://qiskit.github.io/qiskit/#passes.PassBenchmarks.time_cx_cancellation?machine=dedicated-benchmarking-softlayer-baremetal&os=Linux%204.15.0-46-generic&ram=16GB&p-n_qubits=5&p-depth=1024&commits=3c979ebd

As was pointed out in #6433 this is caused by python function call overhead from using @property especially compared to a slotted attribute access that was there before..

Steps to reproduce the problem

Run a transpiler pass or any dagnode operation that frequently uses dagnode.op

What is the expected behavior?

No performance regression.

Suggested solutions

Either revert the change or change our internal dagnode usage to use the _op attribute on access for performance critical code (which is part of what #6433 does for one pass).

enavarro51 commented 3 years ago

After a quick check, it looks like there are about 400 instances of some form of node.op in the code. I assume changing those to node._op is preferable to reverting #6199. If the change is preferred, I could take this on.

kdk commented 3 years ago

After a quick check, it looks like there are about 400 instances of some form of node.op in the code. I assume changing those to node._op is preferable to reverting #6199. If the change is preferred, I could take this on.

If possible, I'd prefer to only access node._op in the cases where we know node.op is causing a performance problem. Even then, this might be better fixed by restructuring the interface so access to a private attribute isn't needed, even for performance. (Maybe redefine node.op to be Optional[Instruction] that's not None if and only if node.type == 'op', or separate sub-classes for IODAGNode and OpDagNode. This still isn't great, but is maybe a step in the right direction.)

That said, I'm a little confused. node.op as an @property has existed since DAGNode was introduced in #1815 . #6199 added an @property for DAGNode.name.

mtreinish commented 3 years ago

That's my fault for a lack of precision in opening the issue. #6199 introduced a regression around DAGNode.name access, if you look at the benchmark I linked in the issue it's for cx cancellation which will be calling DAGNode.name for each node in the dag as part of: https://github.com/Qiskit/qiskit-terra/blob/main/qiskit/transpiler/passes/optimization/cx_cancellation.py#L30 which is why we see the regression there in the nightly benchmarks

In the issue I was conflating that with the changes made in #6433 which were about DAGNode.op and DAGNode.qargs which were the same root cause function call overhead and were probably potential bottlenecks we've had sitting around for some time and just never noticed before.

Maybe @IvanIsCoding can comment here about his profiling on the collect_2q_blocks pass to show the overhead from the function call.

I agree with @kdk that we probably only want to do that in performance critical parts, but not generally. For example, in the circuit drawers (assuming it's used there, which I think it is) it wouldn't make sense to change anything.

enavarro51 commented 3 years ago

So in #6199, we deprecated the use of passing a name when instantiating a DAGNode. We then added the property approach to accessing the name to point it to _op.name. It looks like there are fewer than 100 cases of node.name used, mostly in the passes. How about if we change those node.name's to node._op.name since that's where it's pointing to anyway.

IvanIsCoding commented 3 years ago

That's my fault for a lack of precision in opening the issue. #6199 introduced a regression around DAGNode.name access, if you look at the benchmark I linked in the issue it's for cx cancellation which will be calling DAGNode.name for each node in the dag as part of: https://github.com/Qiskit/qiskit-terra/blob/main/qiskit/transpiler/passes/optimization/cx_cancellation.py#L30 which is why we see the regression there in the nightly benchmarks

In the issue I was conflating that with the changes made in #6433 which were about DAGNode.op and DAGNode.qargs which were the same root cause function call overhead and were probably potential bottlenecks we've had sitting around for some time and just never noticed before.

Maybe @IvanIsCoding can comment here about his profiling on the collect_2q_blocks pass to show the overhead from the function call.

I agree with @kdk that we probably only want to do that in performance critical parts, but not generally. For example, in the circuit drawers (assuming it's used there, which I think it is) it wouldn't make sense to change anything.

I can jump in an say that in my particular case at #6433, @property is used just for the setter and not for the getter. Hence, doing node.op and node._op are equivalent. This might not be as straightfoward in this case.

The speedup by replacing op/_op and qargs/_qargs at collect_2q_blocks comes from:

Almost every if/else checked those two properties, often multiple times per if statement
_qargs/_op are heavily optimized because of the use of __slots__ . That is the fastest attribute look up you can do in Python!
qargs/op are not optimized because there is @property, hence there is a function call which slows things

My advice is to benchmark and see if the gains in th specific case are worth it. In collect_2q_blocks, they were. Counting appearance of node.name is a good starter guess, but mind you those don't account if they're in loops that executed many times or if statements that are never reaached.

kdk commented 3 years ago

Out of curiosity, I ran the following through timeit:

class C():
    def __init__(self):
        self.foo = 'bar'

    @property
    def property_foo(self):
        return self.foo

    @property
    def cond_property_foo(self):
        if not True or 'true' != 'true':
            raise ValueError()
        return self.foo

time per call	without slots	with slots
`c.foo`	35.4 ns ± 1.5 ns	26 ns ± 0.197 ns
`c.property_foo`	119 ns ± 3.56 ns	107 ns ± 2.5 ns
`c.cond_property_foo`	159 ns ± 24 ns	140 ns ± 2.86 ns

(N.B. Python 3.6 on OSX 10.15 )

At least for this microbenchmark, it seems in general like __slots__ saves 10-15 ns per call, @property costs ~80 ns, and the condition checking another 30-40 ns per call. If these are attributes that are accessed frequently enough in a loop (and if these numbers are representative), this could add up to the observed regression.

That said, my preference here would be to either:

Use the _ attributes only in the cases where we know there is an active performance concern (with a comment explaining why we're using the private attribute),
or to restructure DAGNode (either via subclasses for IO and Op types, or by allowing .name,.op,... to be None) so that the @property and explicit type check aren't necessary.

That we have a performance concern over convenience code that wraps and re-raises an AttributeError as a QiskitError suggests to me that there's room for improvement in the design. In general, it would be good to avoid building interfaces that are fast only if the consumer knows how to hold them in the right way, when we can make interfaces that are intentionally fast (as fast as python can be) by design.

mtreinish commented 3 years ago

Personally I think the subclass approach makes the most sense. That way we can avoid the property altogether. Having the explicit type attribute always seemed a bit weird.

Qiskit / qiskit