Runtime recursion depth reached in classical expresssion while implementing stress test

Environment

Qiskit version: 1.0
Python version: 3.9
Operating system: OSX

What is happening?

While running a stress test for classical expressions I am running into a runtime recursion depth issue generating a circuit.

    yield from node.right.accept(self)
../../../../opt/anaconda3/envs/primitives/lib/python3.11/site-packages/qiskit/circuit/classical/expr/visitors.py:73: in visit_binary
    yield from node.right.accept(self)
../../../../opt/anaconda3/envs/primitives/lib/python3.11/site-packages/qiskit/circuit/classical/expr/visitors.py:73: in visit_binary
    yield from node.right.accept(self)
../../../../opt/anaconda3/envs/primitives/lib/python3.11/site-packages/qiskit/circuit/classical/expr/visitors.py:73: in visit_binary
    yield from node.right.accept(self)
../../../../opt/anaconda3/envs/primitives/lib/python3.11/site-packages/qiskit/circuit/classical/expr/visitors.py:73: in visit_binary
    yield from node.right.accept(self)
../../../../opt/anaconda3/envs/primitives/lib/python3.11/site-packages/qiskit/circuit/classical/expr/visitors.py:73: in visit_binary
    yield from node.right.accept(self)
../../../../opt/anaconda3/envs/primitives/lib/python3.11/site-packages/qiskit/circuit/classical/expr/visitors.py:72: in visit_binary
    yield from node.left.accept(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Var(Clbit(ClassicalRegister(1000, 'c0'), 42), Bool()), visitor = <qiskit.circuit.classical.expr.visitors._VarWalkerImpl object at 0x115376aa0>

    def accept(self, visitor, /):
>       return visitor.visit_var(self)
E       RecursionError: maximum recursion depth exceeded

../../../../opt/anaconda3/envs/primitives/lib/python3.11/site-packages/qiskit/circuit/classical/expr/expr.py:138: RecursionError
---------------------------------------------------------------------------- Captured stderr setup ---------------------------------

How can we reproduce the issue?

        n_qubits = self.backend.num_qubits
        n_comparisons = 1000
        # n_comparisons = 200 works for me

        n_measures_per_round = n_qubits - 1

        qr = QuantumRegister(n_qubits)
        cr = ClassicalRegister(n_comparisons)
        final_cr = ClassicalRegister(1)

        qc_expr_test = QuantumCircuit(qr, cr, final_cr)
        # Should measure all 1s except the final qubit
        qc_expr_test.x(qr[:-1])
        measurements_remaining = n_comparisons
        while (measurements_remaining > 0):
            total_measures_performed = n_comparisons - measurements_remaining
            measures_this_round = min(measurements_remaining, n_measures_per_round)
            qc_expr_test.barrier(qr)

            qc_expr_test.measure(qr[:measures_this_round], cr[total_measures_performed:total_measures_performed+measures_this_round])
            qc_expr_test.barrier(qr)
            measurements_remaining = measurements_remaining - measures_this_round

        # Complicated expression which should evaluate to 1

        # Or together all results
        # for a complicated expression
        or_bit_expr = None
        for cbit in cr:
            if or_bit_expr is None:
                or_bit_expr = cbit
            else:
                or_bit_expr = expr.bit_or(cbit, or_bit_expr)
        # Should evaluate to all 1's ideally

        # Map final classical result onto qubit to be measured
        with qc_expr_test.if_test(or_bit_expr):
            qc_expr_test.x(qr[-1])

        qc_expr_test.barrier(qr)
        qc_expr_test.measure(qr[-1], final_cr)

What should happen?

Should build the circuit successfully

Any suggestions?

No response

Posting this one as well. It doesn't error but takes very long to evaluate to the point I stopped at 200 operations and gave up (20 worked)

        n_qubits = self.backend.num_qubits
        n_comparisons = 200

        n_measures_per_round = n_qubits - 1

        qr = QuantumRegister(n_qubits)
        # Store each bit in a register to avoid packing in the compiler
        # in an effort to make less LLVM optimization and more
        # computations happen
        crs = [ClassicalRegister(1) for _ in range(n_comparisons)]
        final_cr = ClassicalRegister(1)

        qc_expr_test = QuantumCircuit(qr, *crs, final_cr)
        # Should measure all 1s except the final qubit
        qc_expr_test.x(qr[:-1])
        measurements_remaining = n_comparisons
        while (measurements_remaining > 0):
            total_measures_performed = n_comparisons - measurements_remaining
            measures_this_round = min(measurements_remaining, n_measures_per_round)
            qc_expr_test.barrier(qr)

            qc_expr_test.measure(qr[:measures_this_round], [crs[cr][0] for cr in range(total_measures_performed, total_measures_performed+measures_this_round)])
            qc_expr_test.barrier(qr)
            measurements_remaining = measurements_remaining - measures_this_round

        # Complicated expression which should evaluate to 1

        # Or together all results
        # for a complicated expression
        or_bit_expr = None
        for cr in crs:
            cbit = cr[0]
            if or_bit_expr is None:
                or_bit_expr = cbit
            else:
                tmp_expr = expr.bit_xor(cbit, or_bit_expr)
                or_bit_expr = expr.bit_or(tmp_expr, tmp_expr)
        # Should evaluate to all 1's ideally

        # Map final classical result onto qubit to be measured
        with qc_expr_test.if_test(or_bit_expr):
            qc_expr_test.x(qr[-1])

For the first comment:

Thanks for the report, it's really good to be doing stress testing of this, and I really hadn't thought about 1000+ element expressions when I made the expression tree system, or wrote everything in terms of visitors. It's not immediately clear to me what the strategy should be here to fixing this for all expression-handling logic; anything that uses the visitors is going to be subject to Python's recursion limiter. Fwiw, I'd naively expect the recursion failure to occur at a little under 1500 nodes deep on Linux or macOS, and a little under 500 on Windows (the visitor uses two function calls for each node - Expr.accept and a corresponding Visitor.visit - and the default sys.getrecursionlimit() is 3000 on Linux/macOS, 1000 on Windows iirc).

That said, the actual failure point here is a context-free visit, so I can pretty trivially reimplement that in terms of an arbitrary-order iterative node walker. The problem is that it'll likely still fail somewhere else down the line, in something like trying to compare circuit equality (which again tbf, I suppose could be context free if the arbitrary-order walker is deterministic, which ofc I would do). Then Aer also uses this visitor pattern to walk expressions (though I argued a small amount for it to linearise to some RPN stack evaluator internally, I don't think we went that way), so it'll choke attempting the simulation, and I think the OQ3 exporter does too. Rewriting the general visitor machinery into something like a deque-based iterative method is a larger task, and I guess probably needs a completely alternate visitor helper base (not impossible, just fiddly).

Fwiw, Python's built-in ast has the exact same pattern with its NodeVisitor (though they implement the visitor slightly differently), so at least on the positive side we've got company haha:

import ast

class Visit(ast.NodeVisitor):
    pass

expr = ast.parse(" + ".join(["x"]*1500))
Visit().visit(expr)

[ ... snip ...]

File ~/python/3.10.6/lib/python3.10/ast.py:254, in iter_fields(node)
    252 for field in node._fields:
    253     try:
--> 254         yield field, getattr(node, field)
    255     except AttributeError:
    256         pass

RecursionError: maximum recursion depth exceeded while calling a Python object

How critical is this for your stress testing? Would the need for humongous binary expressions be mitigated by Qiskit including manual stores to memory and/or bitshift operations in its expression tree?

For the second comment:

Just to point out, this code block:

or_bit_expr = None
for cr in crs:
    cbit = cr[0]
    if or_bit_expr is None:
        or_bit_expr = cbit
    else:
        tmp_expr = expr.bit_xor(cbit, or_bit_expr)
        or_bit_expr = expr.bit_or(tmp_expr, tmp_expr)

produces an expression with something like $2^n - 1$ binary terms in it (or something like that - I kind of ball-parked the algebra).

Creating the expression should be fine, because that's linear - the way you've done it has explicit Python-object re-use. The problem for us comes in walking the expression; the visitor doesn't attempt to do on-the-fly subexpression elimination or anything, so we're now attempting to walk a $2^{200}$ node expression (something like $10^{60}$) to verify that it's safe to add to the circuit.

Questions for this comment:

would us implementing manual stores to memory (so there's a linear-space workaround) be enough for us to say we're just not going to handle this?
alternatively / in addition, would us adding bitshift operations alleviate the problems for you?
if neither of those, are you thinking that something along the lines of a Python-object-identity-based subexpression elimination trick might be enough for expression-handlers in Python that can get away with it? For example, "walk all used variables" can easily do that, as can "are these two expressions structurally identical?".

Fundamentally, I think this giant expression just isn't going to be supported without allowing stores to memory though (which I'm working on, I promise) - if nothing else, we'd have to output it to OQ3 at some point.

Qiskit / qiskit