mtreinish commented 2 years ago

What is the expected enhancement?

In #7213 we added the vf2layout pass to all the preset passmanagers. For the instantiation of that pass on each optimization level we had to set limits on the pass, both in number of internal vf2 state visits (which gets passed directly to retworkx), as well as other options to limit the amount of time we spend searching for an ideal solution. For example:

in each optimization level we set an increasing limit to roughly model the maximum amount of time we want to spend on the pass: 100ms for level 1, 10sec for level 2, and 60 sec for level 3. For a first approach this makes sense as it was easy to implement and will improve the quality of the transpiler significantly when there is a perfect mapping available. However, the limitation with this approach is it doesn't take into account the complexity of the input problem. As both the size of the interaction graph and the coupling graph increase the amount of time it takes for vf2 to find a potential solution can increase significantly. To accommodate this we should come up with a scaling factor for the hard limits we place on the pass so that we can scale these limits to something appropriate for inputs that will require more time. This shouldn't be a problem as the other passes typically scale linearly with the number of qubits (and routing passes scale exponentially) so spending extra time up front to potentially avoid the slower passes later is a decent tradeoff as long as we don't spend too much time.

To implement this we'll need to do some benchmarking of retworkx's vf2_mapping() function to figure out the scaling properties of how long it takes to find a result. Once we have a good model of that we can come up with a formula on how to scale the limits appropriately for each optimization level and apply that in the preset passmanager based on the input target backend.

TheGupta2012 commented 2 years ago

Hey @mtreinish, as a starting point in bench marking of vf2_mapping, what would be the a limit of the coupling graph size if we are trying to find sub graph isomorphism?

mtreinish commented 2 years ago

Well there is technically no limit to the size of the coupling graph size. For benchmarking to start I would probably do a sweep from like 10 node to 10000 nodes (which yeah is really big) and see if we can find a trend, but if 10000 is too big and takes too long you can start smaller. Using some of the retworkx generator functions: https://qiskit.org/documentation/retworkx/api.html#generators (especially https://qiskit.org/documentation/retworkx/apiref/retworkx.generators.directed_heavy_square_graph.html and https://qiskit.org/documentation/retworkx/apiref/retworkx.generators.directed_heavy_hex_graph.html ) will be useful here to quickly create large graphs.

I expect that the performance of the vf2_mapping() is probably going to be a function of both the size of the coupling graph and the size of the 2q interaction graph as well. I would expect that as the size of the interaction graph approaches the size of the coupling graph it will take longer for retworkx to find an isomorphic mapping.

nonhermitian commented 2 years ago

Isn't it most time consuming closer to half the total graph size? There are only so many options for placing an N-1 size circuit

mtreinish commented 2 years ago

Well I was thinking more about time to find that initial isomorphic mapping, When the size of the graphs are closer the search space to verify a mapping is valid is a lot bigger because you have to check more nodes. Yeah, there are less overall posibilities of valid mappings if the subgraph when the graphs are closer in size, but to find that first mapping you have to do a lot more work. I did a very small benchmark to test this with vf2_mapping() basically something like:

import csv
import time

import numpy as np
import retworkx as rx

with open("vf2_times.csv", "w", newline="") as csvfile:
    times_writer = csv.writer(csvfile)
    times_writer.writerow(["cmap_size", "imap_size", "time"])
    cmap = rx.generators.directed_grid_graph(7, 7, bidirectional=True, multigraph=False)
    for i in rang(5, 45):
        imap = rx.generators.directed_path_graph(i)
        times = []
        for _ in range(5):
            start = time.perf_counter()
            next(rx.vf2_mapping(cmap, imap, subgraph=True, id_order=False, induced=False))
            stop = time.perf_counter()
            run_time = stop - start
            times.append(run_time)
        avg_run_time = float(np.mean(times))
        times_writer.writerow([len(cmap), len(imap), avg_run_time])

graphing that output was:

vf2_times

HM0880 commented 2 years ago

@mtreinish

Have you thought about using McKay's nauty [1] to produce a canonical labeling for the isomorphism problem?

[1] https://users.cecs.anu.edu.au/~bdm/nauty/

mtreinish commented 2 years ago

I wasn't familiar with nauty before so I just took a look at it and skimmed the paper on the website. It's an interesting idea and maybe worth playing a bit, but I'm not sure it really would be applicable here. Primarily because nauty seems to about being able quickly test to find whether 2 graphs are isomorphic or not. The way that we're using the vf2 here is to find as many isomorphic mappings as we can and then picking the best performing/lowest noise mapping. I also didn't see anything about solving subgraph isomorphism problems like have here.It had support for induced subgraph isomorphism, but that won't work here (see the discussion around this when we initially added the pass: https://github.com/Qiskit/qiskit-terra/pull/6620#issuecomment-866774729).

There's also a more practical concern around integrating an external c library/standalone tool as a dependency which we really don't have a good way to do directly in terra. If we did decide to start leveraging it we'd probably need to make a standalone python library that wrapped nauty and published precompiled binaries for nauty and then add that as a dependency for terra.

HM0880 commented 2 years ago

Given one graph, nauty produces an ASCII string that can be converted into a graph adjacency matrix with some simple rules [1]. The ASCII string is the canonical labeling of the graph; if two graphs are each canonically labeled, then isomorphism is a simple string comparison. I am not sure if nauty does anything with subgraphs. I have been thinking about a way to use the canonical labeling to handle subgraph isomorphism, but if retworkx already does subgraph isomorphism, there is probably no point in redoing the problem.

[1] http://users.cecs.anu.edu.au/~bdm/data/formats.txt

That said, I am interested in looking at benchmarks for the vf2_mapping function.

HM0880 commented 2 years ago

Misc comments

The vf2_mapping function was next(rx.vf2_mapping(G, H, subgraph=True, id_order=False, induced=False))
The path graph H is the graph for the circuit qubits.
The running time seems to saturate starting with a 20x20 grid graph G (Figure 1). I guess if H is a path graph, then it's not very hard to map a small H onto a large G (related to @nonhermitian 's comment). Unsurprisingly (per Figures 2 and 3), as G gets larger and H gets smaller, the running time gets faster.
In Figure 1, the graphs that finish (e.g., |G| = 6x8) have a logistic curve trend, but very quickly (starting around |G| = 20x20), the times just ramp up and look basically linear.

Proposal

@mtreinish says [via Slack chat], "It [call_limit time] is definitely primarily a function of the number of circuit qubits which means we need to do it [set call_limit time] at run time inside the pass."

HM0880: The best predictor of run time is the size of the subgraph H. |H| = 21 runs under 0.1 seconds for up to the 100x100 grid graph G.

For the three levels of the preset pass managers, I propose the following settings for the call_limit parameter:

Table 1: Proposed `call_limit` times as a function of the size of H.	level	size of H
1	[01, 22)	0.1
2	[22, 24)	1.0
3	[24, 27)	10.0
4 (does not exist)	[27, ∞)	ask the user to set `call_limit`

Plots

allplots Figure 1: Running time vs. the size of the path subgraph H. The 7x7 grid graph had an outlier in one of the 5 trials, which explains the spike. The larger graphs have only 1 trial per data point since the runtimes were so long.

ratio-of-edges-2022-06-13 Figure 2: Running time vs. the ratio of number of edges in G / number of edges in H. Since G is a grid graph, the number of edges in G is G_edges = (m-1)*(n) + (n-1)*(m).

ratio-of-vertices-2022-06-13 Figure 3: Running time vs. the ratio of the size of G / size of H. The size of G is m*n.

mtreinish commented 2 years ago

For the three levels of the preset pass managers, I propose the following settings for the call_limit parameter:

Table 1: Proposed call_limit times as a function of the size of H. level size of H call_limit time 1 [01, 22) 0.1 2 [22, 24) 1.0 3 [24, 27) 10.0 4 (does not exist) [27, ∞) ask the user to set call_limit

Sorry I wasn't clear in my offline comment, the call_limit parameter is the number of internal state visits that retworkx performs before it gives up. Right now we have this value hard coded in the preset pass managers as something roughly approximating a fixed time. Although I just benchmarked a random run locally and this obviously will vary based on the graphs and the local system. The values are currently set to:

level 1: 5e4/~100ms - https://github.com/Qiskit/qiskit-terra/blob/main/qiskit/transpiler/preset_passmanagers/level1.py#L159 level 2: 5e6/~10 sec - https://github.com/Qiskit/qiskit-terra/blob/main/qiskit/transpiler/preset_passmanagers/level2.py#L157 level 3: 3e7/~60 sec - https://github.com/Qiskit/qiskit-terra/blob/main/qiskit/transpiler/preset_passmanagers/level3.py#L160

which was our baseline times for running this pass. The open question now is based on the data that you've collected do you think that dynamically scaling that call_limit value based on the size of H during the run() method of the pass makes sense? If so how would you propose scaling it.

HM0880 commented 2 years ago

Oh, now I understand how “3e7” corresponds to ~60 seconds of run time; thanks. Hopefully, the text below answers your question.

The open question now is based on the data that you’ve collected do you think that dynamically scaling that call_limit value based on the size of H during the run() method of the pass makes sense?

Yes.

If so how would you propose scaling it.

In Figure 1 from yesterday, the running times start to saturate when |G| is 20x20, so it is reasonable to treat |G| = 100x100 as a limiting size. I tried fitting the exponential curve time = a*10^(b*|H|) to the times for 100x100 (see Figure 4). The fit in Figure 4 is good, but the coefficients are specific to the (classical) computer on which this routine runs, so Figure 4 does not give a general solution.

Figure 4: Fit of the exponential curve time = a*10^(b*|H|) the running times for the 100x100 graph G.

allplots-fit-2022-06-16c

To be independent of the classical device computing power, the best method would be to run similar benchmarks, but instead of reporting the time, report the value of the internal vf2 algorithm counter when the algorithm finishes. retworkx uses call_limit in the lines at [1]; how can I get Rust to return the value of self._counter from the routine?

[1] https://github.com/Qiskit/retworkx/blob/71e34d111623d1de2e4870a8227eddacfb3ade4c/src/isomorphism/vf2.rs#L944-L947

mtreinish commented 2 years ago

Well it depends on how you want to return it. The easiest way would probably be to just print it with println!("{}", self._counter) which will print to stdout (but a separate buffer than python's). But if you want to return the value, to interact with probably the easiest way is to change the return type of next() there to PyResult<(usize, Option<NodeMap>)> and then update all the callers and returns to work with the tuple instead of the just Option<NodeMap> (the rust compiler will error for mismatched types if you don't anyway). Once you've updated everything to use the tuple it should be returned to python on each iteration on the return from vf2_mapping() since the python interface just returns the value here: https://github.com/Qiskit/retworkx/blob/71e34d111623d1de2e4870a8227eddacfb3ade4c/src/isomorphism/vf2.rs#L1010-L1017 (you'd have to update that code as part of changing the return type)

HM0880 commented 2 years ago

Here are call_limit counts as a function of |H|.

I ended up using a combination of println in retworkx, Bash redirects, and log files to get the counts. self._counter was printed every 1000 counts, which means that the call_limit counts are rounded down to nearest 1000, but since vf2 reaches ~1 million counts for |H|=26, we do not need to look at every single count. There is only one trial since I tried a small run with 3 trials per |H|, but each trial had exactly the same counts, which makes sense.

The line that goes through the green circles is the best-fit line ( A 10^B*|H| ) to those points. The other line (slightly above the fit line) is a scaled version of the A coefficient. It might be a good idea to round up the A and B coefficients to a nice value, so the scaling of A is my first attempt.

call_limit-counts-2022-06-20

Figure 5: Log plot of call_limit counts to first isomorphic submapping from a path subgraph H onto a grid graph G. The points drawn as x were excluded from the fit. The horizontal lines show counts of 5e4, 5e6, and 3e7 from the current hard-coded counts in Levels 1, 2, and 3 of the pass manager.

Qiskit / qiskit

Add adaptive limits for VF2Layout in preset passmanagers #7705

What is the expected enhancement?

Misc comments

Proposal

Plots