memory usage increase significantly when using coupling map and cz instead of a cx as basis

nonhermitian commented 1 year ago

Environment

Qiskit Terra version: latest
Python version: 3.10.8
Operating system: Ubuntu 22.04 (64Gb memory)

What is happening?

Transpiling circuits seems to take excessive amounts of memory to the point where doing things in parallel can cause all of the memory to be used, freezing ones system. In this example, I am transpiling 20Q circuits over many repeated applications of a unitary. When using 4 cpus, each process continues to take up memory until all memory is consumed, and swap space starts to be taken, after which point the system freezes.

@mtreinish pointed out that this might be a Jupyter variable caching issue, but it is reproducible from the script below direct from Python. one could argue to turn off parallel execution, or reduce it, but as the depth of the circuits grows, this would not solve anything, and likely such issues would be hit with a single process at some point (which I have yet to explore).

The script below is at O2, and there is a big difference between running at O1 and O2 in terms of memory, where the former easily executes within the confines of a 64Gb machine, where the latter kills it every time.

How can we reproduce the issue?

The script and circuits are given below. One can replace a call to ibm_prague with the newly added fake backend version (#9369) and hopefully get the same result.

20Q_DTC_100cycles.zip


import numpy as np
from qiskit import transpile, qpy

from qiskit_ibm_provider import IBMProvider
provider = IBMProvider()
ibmq_backend = provider.get_backend('ibm_prague')

with open('20Q_DTC_100cycles.qpy', 'rb') as fd:
    circs = qpy.load(fd)

cov_layouts = [[4, 1, 2, 3, 5, 8, 11, 14, 16, 19, 22, 25, 24, 23, 21, 18, 15, 12, 10, 7], 
               [31, 30, 3, 2, 1, 4, 7, 10, 12, 13, 14, 16, 19, 22, 25, 24, 23, 27, 28, 29], 
               [0, 1, 2, 3, 5, 8, 11, 14, 16, 19, 22, 25, 24, 23, 21, 18, 15, 12, 10, 7], 
               [3, 2, 1, 4, 7, 10, 12, 15, 18, 21, 23, 24, 25, 22, 19, 16, 14, 11, 8, 9], 
               [1, 2, 3, 5, 8, 11, 14, 16, 19, 22, 25, 24, 23, 21, 18, 15, 12, 10, 7, 6], 
               [7, 4, 1, 2, 3, 5, 8, 11, 14, 13, 12, 15, 18, 21, 23, 24, 25, 22, 19, 20], 
               [8, 5, 3, 2, 1, 4, 7, 10, 12, 13, 14, 16, 19, 22, 25, 24, 23, 21, 18, 17], 
               [7, 10, 12, 15, 18, 21, 23, 24, 25, 22, 19, 16, 14, 11, 8, 5, 3, 30, 31, 32], 
               [19, 16, 14, 11, 8, 5, 3, 2, 1, 4, 7, 10, 12, 15, 18, 21, 23, 24, 25, 26]]

mapped_circs = []
for layout in cov_layouts:
    temp_circs = transpile(circs, backend=None, 
                           coupling_map=ibmq_backend.configuration().coupling_map,
                           basis_gates=ibmq_backend.configuration().basis_gates,
                           initial_layout=layout, optimization_level=2)
    mapped_circs.append(temp_circs)

What should happen?

Memory usage at 20Q should not be blocking me from testing circuits on hardware on a 64Gb machine.

Any suggestions?

No response

nonhermitian commented 1 year ago

As a follow up, if I use the exact same circuits and backend, but change the entangling gate from cz to cx it no longer takes up all of my memory. This is also inline with other results, where I can transpile larger circuits, e.g. width 50 or 100, on systems with cx or ecr gates and not run into problems.

nonhermitian commented 1 year ago

If I leave out the coupling_map and initial_layout, while keeping cz in the basis gates then optimization_level=2 takes only ~400Mb per processes verses 8+Gb in the original example. The transpilation time is also dramatically reduced. In my case, the circuits are already linearly mapped, so I can use the mapomatic inflate routine to do the actual layout. Others are obviously not so lucky.


trans_circs = transpile(circs, backend=None,  basis_gates=['cz', 'rz', 'sx', 'x'], optimization_level=2)

mtreinish commented 8 months ago

I confirmed there is a 10x increase in max memory usage when running the examples still with the current (as of yesterday morning) main branch. Running with a cx basis had a max RSS of 1944372 KB of memory and with cz the max rss reported was 11025292 KB. I was assuming this is primarily a function of the swap -> cz transformation being more inefficient but the 10x growth in memory is a higher than I expected if it was just that (especially now that everything is a singleton). There is probably some intermediate state in the transpiler passes that is consuming much more memory (as max rss is the peak usage during the process) temporarily even though it's not that high at rest.

This is also transpiling all 101 circuits for the full loop, so there is expected linear growth in used memory after each layout iteration we're storing 101 circuit copies, but I don't think each circuit copy is 10x bigger to account for that difference.

jakelishman commented 8 months ago

swap -> cz translation involves 6 rz and 7 sx on Torino even at O3 (12 rz to 6 sx at O1) plus the 3 cz, so that's 2x the number of singleton instances and 6 non-singleton instances per swap, so I think 10x in memory doesn't sound totally out there.

edit: but your numbers actually say 6x - which is even more believable to me.

jakelishman commented 8 months ago

ofc the point being that a) we should reduce the footprint of rz, especially when it's all multiples of pi/2 and b) when the synthesis is going to be identical in a bunch of places, we're possibly going to need to come up with an output-format extension that lets us put things like swap_type_0 as a singleton 2q gate on the circuit, and supply its definition along with the output somewhere - for the cz-ish backends, that would be a 16x reduction in memory for routing immediately (though will be tricky to define output formats for, and to get people to thread through backend stacks).

Qiskit / qiskit