CQCL / hugr

Hierarchical Unified Graph Representation for quantum and classical programs

https://crates.io/crates/hugr

Apache License 2.0

17 stars 5 forks source link

Give Hugr edges unique IDs #1535

Open lmondada opened 1 week ago

lmondada commented 1 week ago

I believe defining hugr subgraphs in generality requires a way to uniquely identify Hugr edges. In SimpleReplacement we solved the lack of edge IDs by using (Node, IncomingPort)s as identifiers, which makes the assumption that every incoming port is connected to at most one outgoing port. This is not true for edges in CFG graphs (thanks @zrho for pointing that out), and so the implementation is actually incorrect.

Why not use hyper-edges?

@zrho made a good point that for the purposes of pattern matching, all that is required is a way to evaluate predicates of the type "node n1 at port p1 is linked to node n2 at port p2". This does not require unique edge IDs. Instead, could one identify an edge simply by the smallest port it is attached to?

This in effect gives IDs (not to edges but) to hyper-edges: assuming p1 < p2, p3, the two edges below would be assigned the same ID:

flowchart TB

subgraph n1
p1
end
subgraph n2
p2
end
subgraph n3
p3
end

p1 --> p2
p1 --> p3

This has advantages for pattern matching: in the case of a float wire, to check that both inputs to n2 and n3 are identical could be done directly, without having to match n1. However, by symmetry this would mean that it would also be desirable for the two edges in the following diagram to be part of the same hyper-edge:

flowchart TB

subgraph n1
p1
end
subgraph n2
p2
end
subgraph n4
p4
end

p1 --> p2
p4 --> p2

Should this be transitive? I don't think so... Thinking of the case in which all wires are floats, we'd obtain that n1 = n4.

More importantly: hyper-edge IDs are not sufficient to define sub-hugrs

It would be desirable that given a Hugr embedding $\phi : P \to H$ obtained from pattern matching, the image of the embedding $\phi(P) \subseteq H$ determines the subgraph of $H$ uniquely. This is not the case if the embedding is defined by its action on nodes and hyper edge IDs. To see why, consider the following two patterns

EDIT: I've now added ports to make the patterns more explicit, but the mermaid rendering is now somewhat confusing. If you look long enough you'll see this is just how tket2 would represent a sequence of rotations (by the same angle) on 1 qubit.

flowchart LR
subgraph a_f64
out1
end

subgraph rx1
in1_f64
in1_qb
out1_qb
end

subgraph rx2
in2_f64
in2_qb
out2_qb
end

subgraph rx3
in3_f64
in3_qb
out3_qb
end

out1 --> in1_f64
out1 --> in2_f64
out1 --> in3_f64
IN1 --> in1_qb
out1_qb --> in2_qb
out2_qb --> in3_qb
out3_qb --> OUT1

Now consider matching the pattern below against the graph above:

flowchart LR

subgraph b_f64
out2
end

subgraph rx4
in4_f64
in4_qb
out4_qb
end

subgraph rx5
in5_f64
in5_qb
out5_qb
end

subgraph rx6
in6_f64
in6_qb
out6_qb
end

out2 --> in4_f64
out2 --> in5_f64

IN2 --> in4_qb
out4_qb --> in5_qb
out5_qb--> in6_qb
out6_qb --> OUT2

Clearly, pattern 2 also matches in the graph of pattern 1. However, both embeddings (of pattern 1 into pattern 1 and of pattern 2 into pattern 1) are identical, when viewed as embeddings on nodes and hyper-edges. Thus one cannot define a hugr subgraph by maintaining a set of nodes and hyperedges. Additional information is required (such as the exact graph that was matched). Edge IDs would solve this.

zrho commented 6 days ago

We don't really have binary edges at the moment

The term edge might not be entirely ideal since the connections between ports do not quite map onto (binary) edges in the classical sense. Rather, there exists a partition of the set of ports into disjoint subsets. It might be more instructive to draw your copy example in this way: Untitled-2024-09-13-1404

The difference of this to binary edges that connect ports becomes clearer in this situation: Untitled-2024-09-13-1404(1) While the semantics excludes this currently both in dataflow and control flow, the data structure allows it. If we used binary edges to express this port connectivity, it would look like this: Untitled-2024-09-13-1404(2) What is not expressible in the notion of graph that we are using is the modified variant: Untitled-2024-09-13-1404(3) So while you can represent every "multiportgraph" with binary edges, not every graph expressed with binary edges is also a valid multiportgraph.

I'd be very reluctant to change that model. It might not be immediately natural when thought of as a graph, but I think it is a great fit for an IR. The hyperedges correspond to variables. An output port produces a value for that variable while an input port consumes a value for that variable. It also has a very natural description as a database so that pattern matching can be done via conjunctive queries. It's basically a directed version of a hypergraph category.

Embeddings vs Subgraphs

Let's denote your two patterns as $P_1$ and $P_2$. I agree that you can't distinguish between the identity $P_1 \to P_1$ and $P_2 \to P_1$ just by their image on nodes and hyperedges. But I don't quite see why that is so important: when you match a pattern, you get the embedding and not only its image; indeed the image isn't even sufficient for rewriting in any case since it does not identify the nodes and ports on the boundary that would need to be reconnected with the outside after the performing the rewrite.

There's a separate point that the map $P_2 \to P_1$ is not even an embedding: in $P_2$ the ports out2 and in6_f64 are not part of the same hyperedge, but their images out1 and in3_f64 are in the same hyperedge of $P_1$. Therefore the induced map on hyperedges is not injective and so $P_2 \to P_1$ not an embedding.

What do you want to do?

Perhaps it would help to use pairs of connected ports as your "edge id"? Could you write down a rewrite that would be hard to express? I'm struggling to help you without knowing the problem you're ultimately trying to solve.

lmondada commented 5 days ago

I absolutely understand your points and agree with them. I am simply trying to figure out how these considerations and properties of the hugr IR should be turned into a practical interface for rewriting. I'm by now convinced that our current code in Hugr is wrong. Assigning edge IDs was my simplest, and somewhat naive, solution.

There's a separate point that the map $P_2 \to P_1$ is not even an embedding

That is a matter of definition: the ports out2 and in6_f64 as far as I am concerned are not part of any hyperedge, and thus it is injective. I am guessing from your point that you view them as "singleton hyperedges" but for my use cases requiring injectivity on unbound ports is too restrictive a definition for patterns: unbound/open ports are boundary ports in patterns, which should match independently of whether or not they are attached to a same hyperedge outside the pattern.

What do you want to do?

I want to know what is the minimum amount of data that defines a rewrite. DPO rewriting in the context of Set (and thus Graph) in essence tells us that a rewrite can always be expressed by two sets $R^-$ and $R^+$, specifying the elements (edges) that must be removed and added respectively.

You are right that the combination of the pattern $P$ alongside with the embedding $P \hookrightarrow G$ (plus replacement graph and a boundary map) is also sufficient, but only as long as $P$ is a concrete pattern. If instead we consider more expressive pattern matching where a pattern represents an entire (possibly infinite) language $F$ of concrete patterns, then how do I know which concrete pattern $P \in F$ was matched? My pattern matcher must be able to express exactly the subgraph that was matched, or otherwise it must trace the graph traversal and construct the concrete pattern that has been matched -- the latter just sounds like an equivalent but more complex and error-prone way of achieving the former.

Proposal

Is it safe to make the following two assumptions? If so it sounds like I can define rewrites by three sets of ports (along with data on how they are to be rewired): $R^-$ ports to be removed by the rewrite $R^+$ ports to be added by the rewrite and boundary ports $B$.

Every port is attached to a unique hyperedge. This should follow from a hyperedge <=> SSA equivalence
Hyper-edges are associative? Fan-outs are associative (copy): for a hyperedge a -> (b, c, d) (output a used in inputs b, c, d), I can rewrite this as a -> (b, x -> (c, d)) (x is some temporary noop).

The latter is what allows me to match a pattern with a binary copy on a ternary copy at the boundary. I don't know that this holds in other cases (CFG?) -- in which case defining the conditions of what is a valid set of boundary ports becomes dependent on node types (or maybe just edge direction?). How would this be specified?

zrho commented 5 days ago

That is a matter of definition

Indeed. I was working with the following definition, which most closely matches what the implementation does:

a set of nodes $N$,
a set of ports $P$,
a set of hyperedges $E$,
functions $\text{in}: N \to P^*$ and $\text{out}: N \to P^*$ of input and output port lists,
a function $\text{edge} : P \to E$,

so that each $p \in P$ occurs exactly once in an input or output port list. This is essentially a hypergraph category, modified to be directed and have ordered port lists. Every port is attached to a unique hyperedge, as you assumed.

With respect to the associativity assumption: yes. For dataflow graphs, hyperedges are essentially the spiders that arise from a cocommutative copy comonoid. For control flow graphs, they are the spiders for the commutative monoid on branches, where multiplication is control flow merges and the unit is a branch that is never taken. Overall, it'd probably be safe to work with it as if each hyperedge was the spider for a special commutative Frobenius algebra. Valid rewrites of data flow should preserve the comonoid fragment, and valid rewrites of control flow the monoid fragment.

(This would be significantly easier if we used monogamous hypergraphs with explicit copies and control flow merges, which is one of the reasons why I'd prefer that model. But so far I've not managed to convince the others of this.)

Boundaries

A boundary of a pattern should not consist of ports, but of hyperedges, I think.

Why the subsets?

It appears important to you that matches are expressed uniquely by subsets, instead of embeddings or maps in general. Why? Any pattern matching algorithm that I can imagine, applied to some family of patterns $F$ in a multiportgraph $Q$, already records sufficient information to (in principle) construct the pattern $P \in F$ together with the map $P \to Q$. Of course you don't need to realise $P$ as an actual graph, but you have information on what matches where. From that you can determine whatever subsets of $Q$ you want, but not the other way around.

(I haven't checked but the structure described at the top of this comment does not look like it'd form an adhesive category. So I'd be careful in trying to use too much intution from DPO for graphs here. I suspect it might be quasiadhesive.)

acl-cqc commented 4 days ago

While the semantics excludes this currently both in dataflow and control flow, the data structure allows it

Indeed - normally we have an implicit/invisible "copy" node, for dataflow edges only, with a single inport but arbitrary outports; and an implicit/invisible "merge" node for controlflow edges only, with a single outport but arbitrary inports. We could have an implicit "something" node with arbitrary inports and outports, even if it doesn't make sense for any current edge type.

So while you can represent every "multiportgraph" with binary edges, not every graph expressed with binary edges is also a valid multiportgraph. I'd be very reluctant to change that model. It might not be immediately natural when thought of as a graph, but I think it is a great fit for an IR.

Love this! :)

acl-cqc commented 4 days ago

A boundary of a pattern should not consist of ports, but of hyperedges, I think.

These are equivalent (any port is part of exactly one hyperedge), so this is not for correctness, but conceptual neatness? (In which case, yes that works for me!)

acl-cqc commented 4 days ago

@lmondada is there actually a Hugr API that takes "an edge" (e.g. there is no remove-edge)? I mean, if you identify Hugr edges by (OutgoingPort, IncomingPort) in portmatching, does hugr need to do anything?

zrho commented 4 days ago

A boundary of a pattern should not consist of ports, but of hyperedges, I think.

These are equivalent (any port is part of exactly one hyperedge), so this is not for correctness, but conceptual neatness? (In which case, yes that works for me!)

I don't think they're equivalent: Consider the he pattern which matches a single hyperedge but no nodes. The source and target boundary of this is the hyperedge, but there are no nodes and therefore no ports in the pattern. This might be a bit degenerate, so here's another: The pattern which matches a node with two input ports, but requires the input ports to be of the same hyperedge: Untitled-2023-12-22-1619 Essentially this pattern matches the operation together with a copy. It appears to me that you can't express this just using ports.

lmondada commented 4 days ago

My problem all along can be reformulated by saying that if edges can have more than two ends, then we need to specify which ends of hyper edges are part of a match. It is strictly more general than matching hyperedges (corresponds to matching all ports attached to that edge), but can also specify matches that only apply to a subset of the ports of a hyperedge: think of an output that is copied three times, matching an input boundary that takes two copies.

This might be equivalent to matching hyperedges but expanding them into several edges using the commutative Frobenius algebra whenever necessary.

zrho commented 4 days ago

(I haven't checked but the structure described at the top of this comment does not look like it'd form an adhesive category. So I'd be careful in trying to use too much intution from DPO for graphs here. I suspect it might be quasiadhesive.)

I have meditated a bit more on this, and it now appears to me that you can make our multiportgraphs form an adhesive category. Consider the category $\mathbb{M}$ with the following objects and morphisms:

An object $N_{i, j}$ for every $i, j \geq 0$.
An object $E$.
A morphism $\text{in}_k : N_{i, j} \to E$ for every $0 \leq k < i$.
A morphism $\text{out}_k : N_{i, j} \to E$ for every $0 \leq k < j$.

Then (a candidate for) the category of multiportgraphs is the category of functors $X : \mathbb{M} \to \text{Set}$. $X(N_{i, j})$ is the set of nodes with input arity $i$ and output arity $j$, $X(E)$ is the set of hyperedges, and the functions tell you which hyperedge the ports are connected to. The result is a topos and therefore adhesive; perhaps spelling out DPO rewriting in there would be helpful?

zrho commented 4 days ago

think of an output that is copied three times, matching an input boundary that takes two copies.

So let's say we have a rewrite that takes an int that is copied and then added to itself, and rewrites it into a doubling: Untitled-2023-12-22-1619(3) Note that this pattern relies on the corner case I mentioned above where the boundary is given by a hyperedge but can't be given by ports alone.

We can then have a program that produces an int, copies it three times. The first two copies are added, the third does something else: Untitled-2023-12-22-1619(1)

Applying our rewrite to this program should result in this modified program: Untitled-2023-12-22-1619(2)

So far, is this the type of scenario you had in mind?

As far as I can tell, the following is a perfectly valid double pushout rewrite (in the category that I have sketched above): Untitled-2023-12-22-1619(4)

Does that address your issue for that example? And do you have other potentially problematic cases?

zrho commented 4 days ago

Another interesting example: Say we can rewrite f followed by a into the identity, but happen to have a copy of the intermediate value: Untitled-2023-12-22-1619(6) I think there is no way to fill in the box marked with ??? such that the square becomes a pushout square. Any candidate to fill in that box would have to include b and therefore also the middle hyperedge, but then the pushout would copy that hyperedge.

This is great! The rewrite arguably shouldn't be possible, since b still needs the output of f but the rewrite would remove that. A similar case occurs when the rewrite is a fusion of f and a: Untitled-2023-12-22-1619(7) This could be vertical loop fusion, for instance, in which case it would be a likely performance win to join the operations if the intermediate result is not used otherwise but potentially a big performance hit if it is.

acl-cqc commented 4 days ago

The pattern which matches a node with two input ports, but requires the input ports to be of the same hyperedge:

Nice example! :). I think here I'd have said that the boundary was the unique outport connected to both inports; I guess that'd mean the (hyper)edge was in the pattern, but the source node was not. (Really I've always thought of the pattern containing only nodes, not edges, but that only works with explicit copies - which I'm not against myself.) There are more how-to-define questions here - does a pattern contain its boundary, or is the boundary expressed in terms of a different type of thing (e.g. edge or port) to the things in the pattern (e.g. just nodes, or nodes+edges but not ports).

Say we can rewrite f followed by a into the identity, but happen to have a copy of the intermediate value....[snip] ....the rewrite arguably shouldn't be possible, since b still needs the output of f but the rewrite would remove that.

What if we wanted to argue the rewrite was possible - keeping 'f' but removing 'a' - as this would potentially enable far more rewriting downstream (nodes after the 'a' are now connected directly to those before the 'f') ?

(Also, I should stress that we should be thinking in terms of a second pass to select which applicable rewrites to perform, rather than greedily applying anything that matches.)

zrho commented 4 days ago

What if we wanted to argue the rewrite was possible - keeping 'f' but removing 'a' - as this would potentially enable far more rewriting downstream (nodes after the 'a' are now connected directly to those before the 'f') ?

I've just worked this out in the fusion example, but it would work similarly in the rewrite to identity scenario. To be able to keep f in the double pushout scenario, you would have to include it in the rewrite: Untitled-2023-12-22-1619(9) If the intermediate result turns out no to be used, f is still left in, but dead code elimination could then get rid of it: Untitled-2023-12-22-1619(10)

For the sake of completeness, rewriting f followed by a to the identity while allowing to reuse the result of f would look like this: Untitled-2023-12-22-1619(11) The horizontal maps on the right merge the two orange hyperedges into one.