Open alonkukl opened 1 month ago
Marc, Ethan, and I discussed this way back when we first started writing BQSKit. The problem with this is that any attempt to parallelize past the next batch of the frontier can "overrule" the search heuristic. For example, if you have a suboptimal branch of the search tree returning results quickly, the algorithm may return a suboptimal result due to performance reasons. We decided against this at the time because we were saturating the cluster through the partitioning workflows.
Two things:
We can definitely do something to this effect, but we should be careful how we "select" the solution. We cannot just select the first solution if we parallelize past one batch of the frontier; we must be careful that no potential better solution is still waiting to be processed. We did this thought experiment in the past, and we worried this might render any speedups you get from this null. We should, however, revisit this, as there may be ways we can submit all nodes from the first k
layers of the tree to bootstrap the standard search process. Let's talk if you'd like to explore.
Another thing on the backburner is how to budget processing resources for each pass. In a partitioning workflow with many blocks, it may be better for the overall workflow if we don't do much parallelization at the lower levels: synthesis + instantiation. However, if it is just a synthesis workflow, then maybe we should do as much parallelization as possible. I thought about having a number in the pass data representing the number of workers at the current pass's disposable. It starts at the whole cluster, but the ForEachBlockPass
can reduce this for each block. Something to this effect. We should also chat about this. It goes hand-in-hand with your other issue on GPU scheduling.
I think we can overcome the "overrule" issue, if we wait for the batch of frontiers to finish, and only then create a new batch of frontiers.
I think the above flow will also eliminate the issue of selecting. maybe we better talk about it today.
The GPU issue needs a much wider solution, as we need to define workers with GPU resources and submit a batch of jobs that require GPU only to these workers. On the same node, we might have a set of workers with GPU and a set without
Currently BQSKit does parallelize when adding new circuits to its search frontier. I think this may be what @alonkukl was talking about, because when synthesizing a 7 qubit all-to-all topology circuit, there will be 21 new circuits added to the search frontier at each step. Each of these circuits is instantiated in parallel, before being analyzed and added to the list.
@edyounis There is a solution to this that works well enough it was the default in the original qsearch for a while: “beam search”. You simply pop the top N nodes off the search tree at each layer instead of just the top node. It generally speeds up the search with little to no effect on the quality of solution (for small values or N). You get diminishing returns on the speed improvement and more risk of affecting the result as you increase N.
In the context of partitioning and resynthesis of larger circuits, I think the current system is better than using beams - you still have a lot of parallelism to take advantage of a lot of cores (even more so if you also have multistart instantiation!) but you never waste any compute time that isn’t strictly necessary.
In situations where the runtime has many workers (in an HPC cluster we can have on 10 nodes 1280 workers), QSearch and LEAP are underutilizing it, as they only submit a single batch from the frontier. For example, if we have a 7 qubit circuit, an 2-qubit gates, there will be 7 choose 2 (= 21) jobs in the batch.
I suggest that these passes will be able to send a batch of frontiers, thus reducing the time to solution.