dominikbraun / graph

A library for creating generic graph data structures and modifying, analyzing, and visualizing them.
https://graph.dominikbraun.io
Apache License 2.0
1.82k stars 96 forks source link

stable topological sort #61

Closed warpfork closed 1 year ago

warpfork commented 1 year ago

Sometimes it is desirable to have a stable output for operations such as topological sort.

There are two ways of doing this (depending on if the goal is stability, or merely determinism):

I tend to favor the former approach, because it provides both properties at once (and also because it feels more general: edge order is a nice tiebreaker to have in a lot of scenarios, not just this one). But either works for producing deterministic outputs.

The current implementation can produce random reordering among nodes that aren't constrained. (I assume this is because a golang map is used somewhere.) Sometimes this is fine! But in some scenarios, less fune: unstable output ordering is a big bummer if you have application level predictability desires... Or even just are doing text fixtures which depend on consistency :)

geoah commented 1 year ago

Hey @warpfork :)

Yeah, this is something I need as well but unfortunatelly it's not just one map that's causing the issue. - Ignoring the underlying structure the edges/vertices are stored, the AdjacencyMap() and PredecessorMap() methods that are used by the various graph methods both return map[K]map[K]Edge[K].

Maybe something like the following could work? I think it should be enough to satisfy the needs of the various callers. Only reason this is an interface is to allow transparently refactoring the implementation in the future.

// placeholder names for struct/methods, suggestions welcome
type Edges interface {
  Range(func(K, []K) bool)
  GetPredecessors(K) []K
}

@dominikbraun I wonder if this could be tackled at the same time as #62?

dominikbraun commented 1 year ago

@dominikbraun I wonder if this could be tackled at the same time as #62?

@geoah Sure, there's no need to refactor the edge storage twice.

geoah commented 1 year ago

I was taking a quick look at this and since comparable only allows for equality and not ordering, we can only go as far as making the sorting stable given the order the edges were added is the same.

Contraints like like constraints.Ordered would significantly limit what we could use for hashes, same with specifying an interface for hashes that could enable ordering.

Is there an obvious way of sorting comparables or something similar that I might be missing?

deitch commented 1 year ago

Ah, I just came across this. I too am using TopologicalSort(), and couldn't figure out why I was getting slightly different ordering each time, until I dug into it.

In my case, I would perfectly happy to do something like, "nodes at the same level, sort lexicographically by key", anything as long as it is consistent.

Has anyone figured a workaround to it?

dominikbraun commented 1 year ago

@deitch I'm not sure if there is a workaround with the existing functions other than implementing it yourself, but this issue is gaining more and more significance and hence will be tackled in one of the very next releases.

deitch commented 1 year ago

other than implementing it yourself

That seems like a bit of a waste.

I have been thinking a bit about the UI. Are we thinking that we would have some additional property to each vertex, e.g. SortWeight, such that it would be considered between various vertices, all else being equal? Or maybe SortFunc, the same way as sortSlice() takes a less func?

Actually, I rather like that last option. It keeps you from having to predetermine what algorithms work for whom, and just let people set their own sort priority. If you want consistently stable, you always could have a default func that can get overridden.

How can I help?

dominikbraun commented 1 year ago

@deitch I'd also prefer a less function for direct comparisons. That would be one way. The other way would be to keep track of the order in which the edges have been added to the graph as proposed by @warpfork. Would this approach work for you, too?

deitch commented 1 year ago

keep track of the order in which the edges have been added to the graph

I am concerned that it can be arbitrary. For example, let's say I have a graph with 5 nodes with no dependencies, and 5 direct dependencies, one of each. Something like:

A -> 1
B -> 2
C -> 3
D -> 4
E -> 5

I would think a normal expectation would be that it wouldn't matter what order I added them, the graph is the same, and so the sort order should not depend if I added A->1 and later B->2 or the reverse.

However, if time does matter for some people, we could track when the nodes where added, and then they can use the less style function to track it to meet their needs.

dominikbraun commented 1 year ago

Yes, I think you're right. A less-function-style or lexicographical-style ordering of vertices is a saner default than a time-based ordering, at least when it comes to reproducibility.

When you are given a mere adjacency map or a representation like in your example, you can't make any assumptions about the order in which the edges have been added but only about the ordering of vertices that makes sense for your case.

We could provide a new function like StableTopologicalSort (or something similar, to maintain backwards compatibility) which takes a less function. If the user needs a time-based ordering, they could do this themselves within the less function.

deitch commented 1 year ago

Yup, answers all use cases.

I don't know that you need a new function. I would just use the existing one. Is there any case where someone might want to have non-stable sort intentionally?

Ah, or are you thinking, if we want to be able to sort according to multiple algorithms, and so we should pass the less function, rather than make it a property of the vertex? I can see that. One time I might call sort with one function, another time with another, so it is important to have an argument, and hence we need an argument and therefore changing the func or adding a new one? Got it.

dominikbraun commented 1 year ago

Ah, or are you thinking, if we want to be able to sort according to multiple algorithms, and so we should pass the less function, rather than make it a property of the vertex?

Exactly. The less function doesn't need to be a vertex property because it's going to be the same for all vertices, and it could be different for functions other than TopologicalSort, so accepting it as an argument would make sense.

deitch commented 1 year ago

If I can help, let me know.

dominikbraun commented 1 year ago

If you want, you can either help by implementing the new function, writing tests for the new function, or verify/test the new function once it is implemented. In any case, feedback is appreciated :-) I'm going to start with the implementation this weekend if nobody else wants to do it.

deitch commented 1 year ago

I wish I could tell you I would get to it this weekend, but I likely won't have time for another week. But post here as you go through it, and I will jump in as I can. I probably can be more helpful with test cases anyways, as that does not require knowing the internal guts of the graph implementation.

dominikbraun commented 1 year ago

@warpfork @tonglil @vito @deitch @williamfzc

This change has been released in graph v0.22.0.

deitch commented 1 year ago

And a beautiful thing it is. Enjoying it right now.

deitch commented 1 year ago

@dominikbraun I am seeing major performance degradation with the new sort, even if I do not use the stable function. A program that used to take 42s, with the sort func not even showing up in the flame graph, is now taking ~222s, with the sort the single largest factor. I will open a new issue and try to replicate it in a sample.