dominikbraun / graph

A library for creating generic graph data structures and modifying, analyzing, and visualizing them.
https://graph.dominikbraun.io
Apache License 2.0
1.8k stars 95 forks source link

stable sort has major performance degradation #129

Open deitch opened 1 year ago

deitch commented 1 year ago

The new ToplogicalSort(), based on its StableTopologicalSort(), has really large performance degradation.

A program that used to take 42s, now takes 220-300s (sometimes more), with the vast majority of the time in StableToplogicalSort(). The current usage is the OSS wolfictl, but I will try to create a reproduction that is focused just on this.

The flame graphs below are from running that with the exact same inputs. The only difference is if I use v0.21.0 or v0.22.0

The first graph is calling StableToplogicalSort() with v0.22.0:

graph-stable-sort-v0220

The second is calling TopologicalSort() with v0.22.0 (unchanged, which isn't surprising, as it just wraps StableTopologicalSort():

graph-topological-v0220

The third is calling TopologocialSort() with v0.21.0:

graph-v0210

In the last, the sort doesn't even take enough time to be bothered into the list.

It might not be the sort itself, as other things might have changed between v0.21.0 and v0.22.0 🤷‍♂️

deitch commented 1 year ago

In the meantime, it should be easy to replicate with a graph that is [string, string], create a graph with maybe 600 nodes and a few edges among them, make sure it isn't cyclical, and then call TopologicalSort(). But I will try to recreate it.

dominikbraun commented 1 year ago

Thank you for filing this issue. I did some research and decided to proceed as follows:

I will create a patch release that reverts TopologicalSort to the old implementation so that it remains unaffected by the performance issues of StableTopolocialSort.

Then I'm going to take a look at the implementation of StableTopologicalSort and its performance penalties. I'm pretty sure the frequent calls to sort.Slice are the problem, and I will have to dig into its performance characteristics and time complexity first. Maybe there is a more time-efficient solution.

I've created a program that will create a disconnected directed graph consisting of simple vertex chains and run StableTopologicalSort on it.

package main

import (
    "github.com/dominikbraun/graph"
)

const (
    numberOfVertices = 12
    lengthOfChains   = 4
)

func main() {
    if numberOfVertices%lengthOfChains != 0 {
        panic("numberOfVertices must be divisible by lengthOfChains")
    }

    numberOfRows := numberOfVertices / lengthOfChains

    g := graph.New(graph.IntHash, graph.Directed())

    for i := 1; i <= numberOfVertices; i++ {
        _ = g.AddVertex(i)
    }

    vertex := 1

    for i := 1; i <= numberOfRows; i++ {
        for j := 1; j <= lengthOfChains; j++ {
            if j < lengthOfChains {
                if err := g.AddEdge(vertex, vertex+1); err != nil {
                    panic(err)
                }
            }
            vertex++
        }
    }

    _, _ = graph.StableTopologicalSort(g, func(a, b int) bool {
        return a < b
    })
}

The number of vertices and the chain length can be set using numberOfVertices and lengthOfChains. Note that numberOfVertices has to be a multiple of lengthOfChains. Maybe this is enough for testing the performance.

Once this is fixed, I'll release another patch release containing the optimized implementation.

deitch commented 1 year ago

I created a slight variant that has a much larger graph, and includes generating a profile file:

package main

import (
    "fmt"
    "log"
    "os"
    "runtime/pprof"

    "github.com/dominikbraun/graph"
)

const (
    numberOfVertices = 500
    lengthOfChains   = 125
    profFile         = "/tmp/cpu.prof"
)

func main() {
    if numberOfVertices%lengthOfChains != 0 {
        panic("numberOfVertices must be divisible by lengthOfChains")
    }

    numberOfRows := numberOfVertices / lengthOfChains

    g := graph.New(graph.IntHash, graph.Directed())

    for i := 1; i <= numberOfVertices; i++ {
        _ = g.AddVertex(i)
    }

    vertex := 1

    for i := 1; i <= numberOfRows; i++ {
        for j := 1; j <= lengthOfChains; j++ {
            if j < lengthOfChains {
                if err := g.AddEdge(vertex, vertex+1); err != nil {
                    panic(err)
                }
            }
            vertex++
        }
    }

    f, err := os.Create(profFile)
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()

    _, _ = graph.StableTopologicalSort(g, func(a, b int) bool {
        return a < b
    })
    fmt.Printf("saved output profile to %s\n", profFile)
}

Even with a 500-node graph, it still is giving only 250ms to sort the graph. How strange that mine give orders of magnitude more. And I was careful here to profile only starting before the sort, not before the creation.

I wonder if it could have to do with the node type? I will try that next.

deitch commented 1 year ago

FYI, I tried adding graph.Acyclic() and graph.PreventCycles(), didn't have a material impact

deitch commented 1 year ago

I modified it to give different string sizes, and different chain sizes. Here are some interesting outputs.

number of vertices: 500
length of chains: 125
string 10 sort time was 1.972775917s
string 20 sort time was 1.175261542s
int sort time was 278.591834ms

number of vertices: 500
length of chains: 25
string 10 sort time was 1.518682292s
string 20 sort time was 1.608444458s
int sort time was 1.466619s

number of vertices: 500
length of chains: 25
string 10 sort time was 1.618567583s
string 20 sort time was 2.099522875s
int sort time was 1.544102917s

number of vertices: 500
length of chains: 10
string 10 sort time was 1.839373541s
string 20 sort time was 1.667635875s
int sort time was 3.910998084s

We can try and get that into a table, but basically:

No idea what is going on. I can try and convert this into a general testing regimen program.

here is the basic one I am using:

package main

import (
    "fmt"
    "log"
    "math/rand"
    "os"
    "runtime/pprof"
    "time"

    "github.com/dominikbraun/graph"
)

const (
    numberOfVertices = 500
    lengthOfChains   = 10
    profFile         = "/tmp/cpu.prof"
)

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

func randSeq(n int) string {
    b := make([]rune, n)
    for i := range b {
        b[i] = letters[rand.Intn(len(letters))]
    }
    return string(b)
}

func main() {
    if numberOfVertices%lengthOfChains != 0 {
        panic("numberOfVertices must be divisible by lengthOfChains")
    }

    stringg10, stringComp10 := stringGraph(10)
    stringg20, stringComp20 := stringGraph(20)
    intg, intComp := intGraph()

    f, err := os.Create(profFile)
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()

    string10Start := time.Now()
    _, _ = graph.StableTopologicalSort(stringg10, stringComp10)
    string10Duration := time.Since(string10Start)
    string20Start := time.Now()
    _, _ = graph.StableTopologicalSort(stringg20, stringComp20)
    string20Duration := time.Since(string20Start)
    intStart := time.Now()
    _, _ = graph.StableTopologicalSort(intg, intComp)
    intDuration := time.Since(intStart)

    fmt.Printf("number of vertices: %d\n", numberOfVertices)
    fmt.Printf("length of chains: %d\n", lengthOfChains)
    fmt.Printf("string 10 sort time was %s\n", string10Duration)
    fmt.Printf("string 20 sort time was %s\n", string20Duration)
    fmt.Printf("int sort time was %s\n", intDuration)
    fmt.Printf("saved output profile to %s\n", profFile)
}

func stringGraph(size int) (graph.Graph[string, string], func(a, b string) bool) {
        rand.Seed(time.Now().UnixNano())

        numberOfRows := numberOfVertices / lengthOfChains

        g := graph.New(graph.StringHash, graph.Directed(), graph.Acyclic(), graph.PreventCycles())

        var vertices []string
        for i := 0; i < numberOfVertices; i++ {
                vertex := randSeq(size)
                vertices = append(vertices, vertex)
                _ = g.AddVertex(vertex)
        }

        vertex := 0

        for i := 0; i < numberOfRows; i++ {
                for j := 0; j < lengthOfChains; j++ {
                        if vertex < len(vertices) - 1 {
                                if err := g.AddEdge(vertices[vertex], vertices[vertex+1]); err != nil {
                                        panic(err)
                                }
                        }
                        vertex++
                }
        }
    return g, func(a, b string) bool {
        return a < b
    }
}

func intGraph() (graph.Graph[int, int], func(a, b int) bool) {
        numberOfRows := numberOfVertices / lengthOfChains

        g := graph.New(graph.IntHash, graph.Directed(), graph.Acyclic(), graph.PreventCycles())

        for i := 1; i <= numberOfVertices; i++ {
                _ = g.AddVertex(i)
        }

    vertex := 1

    for i := 1; i <= numberOfRows; i++ {
        for j := 1; j <= lengthOfChains; j++ {
            if j < lengthOfChains {
                if err := g.AddEdge(vertex, vertex+1); err != nil {
                    panic(err)
                }
            }
            vertex++
        }
    }

    return g, func(a, b int) bool {
        return a < b
    }
}
deitch commented 1 year ago

FWIW, I tried replacing the less argument with a sort.Interface arg so I could call sort.Sort() instead of sort.Slice(). It seems to shave off about 15%, not insignificant, but not a ton either.

Here is part of the updated dag.go, in case it is helpful.

func TopologicalSort[K comparable, T any](g Graph[K, T]) ([]K, error) {
        return StableTopologicalSort(g, nil)
}

func StableTopologicalSort[K comparable, T any](g Graph[K, T], sorter func([]K)  sort.Interface) ([]K, error) {
        if !g.Traits().IsDirected {
                return nil, fmt.Errorf("topological sort cannot be computed on undirected graph")
        }

        predecessorMap, err := g.PredecessorMap()
        if err != nil {
                return nil, fmt.Errorf("failed to get predecessor map: %w", err)
        }

        queue := make([]K, 0)

        for vertex, predecessors := range predecessorMap {
                if len(predecessors) == 0 {
                        queue = append(queue, vertex)
                }
        }

        order := make([]K, 0, len(predecessorMap))
        visited := make(map[K]struct{})

        if sorter != nil {
                sort.Stable(sorter(queue))
        }

        for len(queue) > 0 {
                currentVertex := queue[0]
                queue = queue[1:]

                if _, ok := visited[currentVertex]; ok {
                        continue
                }

                order = append(order, currentVertex)
                visited[currentVertex] = struct{}{}

                for vertex, predecessors := range predecessorMap {
                        delete(predecessors, currentVertex)

                        if len(predecessors) == 0 {
                                queue = append(queue, vertex)
                                if sorter != nil {
                                        sort.Stable(sorter(queue))
                                }
                        }
                }
        }
       // skipping lots of lines
}

You can replace sort.Stable() with sort.Sort() for the other implementation.

In case the updated graphs are helpful, here they are:

sort.Stable:

graph-sort-stable

sort.Sort:

graph-sort-sort

sort.Slice (current implementation):

graph-sort-slice

Finally, I decided to try the experimental new generic sorting package at https://pkg.go.dev/golang.org/x/exp/slices#SortFunc

graph-slices-new

The signatures for the last one aren't changed at all, just the content of StableToplogicalSort():

import (
        "golang.org/x/exp/slices"
)

// ...
        if less != nil {
                slices.SortFunc(queue, less)
        }

instead of sort.Slice() in the 2 places is it called.

These all seem to help. It isn't too bad getting it from 208s to 173-179s, and then down to 158s. For a true performance test, I would need to run each one hundreds of times. I hope this helps.

dominikbraun commented 1 year ago

graph v0.22.1 reverts the implementation of TopologicalSort.

deitch commented 1 year ago

Do we want to pursue any of those Sort alternatives?

dominikbraun commented 1 year ago

@deitch Sorry for the late feedback, I've been busy in my new job the last couple of days. I've re-opened the issue because I'd like to keep track of the performance and maybe go for a ideal or close-to-ideal implementation using a binary heap for the internal queue, as stated in #131.

deitch commented 1 year ago

Congratulations. That's a good reason. Share your LinkedIn profile here for a minute (can delete later), and I'll connect.

dominikbraun commented 1 year ago

@deitch https://www.linkedin.com/in/dominikbraun-io/

deitch commented 1 year ago

Connection sent.