Parallel graph algorithms

gdalle commented 10 months ago

Some resources to learn more about parallel graph algorithms and pick a topic of interest

Reading list:

Chapters 1-6 of Guide to graph algorithms: sequential, parallel, distributed (look for the PDF on LibGen). You don't have to read all of it in detail
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

Assignment:

Review the state of parallel algorithms in Graphs.jl

gdalle commented 8 months ago

26.02.2024

Presented a proposition (see below)

Student tasks:

Create a package called ParallelGraphAlgorithms.jl following https://modernjuliaworkflows.github.io/sharing/ that depends on Graphs.jl
Implement useful parallel primitives: reduce, scan, filter, vertexmap, edgemap
Possibly use https://github.com/JuliaFolds2/OhMyThreads.jl
Find exhaustive description of (sequential) low diameter decomposition

Instructor tasks:

Read "Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable" in detail
Brush up on low diameter decomposition

If interested:

Graphs community call https://julialang.org/community/ (see calendar)

Next meeting: friday to discuss package creation

AntoineBut commented 8 months ago

Parallel Graphs in Julia : Proposition de plan

-Implémentation de certaines primitives nécessaires à de nombreux algorithmes parallèles, potentiellement dans utils.jl : Reduce : déjà implémenté Scan Filter EdgeMap & VertexMap

-Implémentation d’un premier algorithme complet Low-Diameter Decomposition (utilisé dans de nombreux autres algorithmes) Parallel speed-up benchmark (en plus des tests de correctness)

-Implémentation d’un second algorithme plus complexe Graph Coloring (NP-Hard) Benchmarks : random input, worst-case input (on peut s’attendre à une large différence) Comparaison des résultats obtenus et des garanties théoriques, analyse statistique

AntoineBut commented 8 months ago

L'implémentation de "Low Diameter Decomposition" qui est évoquée dans le papier de recherche "Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable" semble être basée sur ce papier : Parallel Graph Decompositions Using Random Shifts. L'algorithme y est décrit de manière assez claire en une dizaine de page, je pense que c'est un bon point de départ.

gdalle commented 8 months ago

Cool, je jetterai un œil avant notre meeting de vendredi !

gdalle commented 8 months ago

01.03.2024

Work done

Set up repo

Tasks

Check that GitHub actions work and finish set up (Documenter, CodeCov)
Implement primitives in 2 ways:
- atomic: naive or with Base.Threads.Atomic
- parallel: sequential or with OhMyThreads.jl
Implement BFS as a first algo, try to measure performance in sequential and parallel

gdalle commented 8 months ago

06.03.2024

Work done

First implems in https://github.com/KassFlute/ParallelGraphs.jl

Tasks

BFS running with tests and benchmarks

Specification

BFS should return the exploration tree: a vector of parents and possibly a vector of distances

Performance

Use better priority queue in https://github.com/JuliaCollections/DataStructures.jl
Use Base.Threads for atomic operations and either Base.Threads or OhMyThreads.jl for high-level (map, reduce, etc => EdgeMap and VertexMap)

Testing

Add Graphs.jl to test dependencies
Test special cases:
- empty graph
- complete graph
- graph with 2 connected components
- directed and undirected graphs
- weighted graphs (doesn't change the algo)

Benchmarking

For the benchmark instances, some graph properties are important:
- no isolated vertices, ideally connected
- sparse graph (constant expected degree as the size grows)
Avoid benchmarking with globals: read beginning of https://juliaci.github.io/BenchmarkTools.jl/stable/manual/
Use https://github.com/tkf/BenchmarkCI.jl to run benchmarks for each pull request (with optional label) => work on branches of the repo, not forks
Skip benchmarking tuning by setting evals=1 after each @benchmarkable (the functions are slow enough)

Good practices

Style: https://github.com/JuliaDiff/BlueStyle?tab=readme-ov-file#module-imports
JET bugging: update Julia version and JET version

Questions

Atomic seems deprecated? Investigate

AntoineBut commented 8 months ago

Parallel weighted BFS :

Implantation is significantly more complicated than non-weighted variant. Requires a "Bucket" data-structure as introduced in https://www.cs.umd.edu/~laxman/papers/Bucketing.pdf .

I will look into this for the next meeting.

gdalle commented 8 months ago

Interesting, but I would start by getting the unweighted version up and running!

gdalle commented 8 months ago

15.03.2024

Work done

Make BFS return tree
Parallel BFS runs
Parallel version is still slower
CI hard to run with multiple threads: see here for the config: https://github.com/gdalle/HiddenMarkovModels.jl/blob/f7cf63b48fb4853376071772ce35c55a73f57e5c/.github/workflows/benchmark.yml#L23-L31
Set the number of threads without restarting Julia? Was possible in the JuliaFolds ecosystem (https://juliafolds.github.io/data-parallelism/howto/faq/#set-nthreads-at-run-time), not sure it can be done with JuliaFolds2 and in particular OhMyThreads

During meeting

Learned to profile with @profview in VSCode
Learned to dive into a function with Cthulhu.jl
No need to parallelize over both sources and neighbors
Profile the function once storage (atomic) has been allocated
Overhead of launching a thread is too big for what is done with each source: look into tasks and spawning instead

Tasks

Read stuff
Two parallel operations:
- parent updates (atomic)
- queue pushes (can be separate with a final merge)
Benchmark without parents and queue allocations

gdalle commented 7 months ago

21.03.2024

Remarks

Benchmarking suite: setup phase to initialize mutated arguments, keep evals=1 to avoid starting the function with unclean inputs

using BenchmarkTools
g = SimpleGraph(100, 200)
@btime bfs!(parents, $g) setup=(parents=zeros(Int, nv(g)) evals=1

When no samples are collected in the profiling phase, run the function several times

repetitions = 10
@profview for r in 1:repetitions; bfs(g); end

Be careful when profiling a function that mutates its input, because after the second time it is unclean. Either pick an input large enough or do

parents_several = [zeros(Int, n) for r in 1:repetitions]
@profview for r in 1:repetitions; bfs!(g, parents_several[r]); end

Use thread-local storage instead of task-local for the queues at each BFS iteration. Put nthreads() queue objects inside a Channel, as done in https://juliafolds2.github.io/OhMyThreads.jl/stable/literate/tls/tls/#The-safe-way:-Channel

To do

Objective x2 speedup for parallel BFS, at least on certain graphs
Plots showing how the speedup evolves as a function of graph size, average degree

gdalle commented 7 months ago

2024.03.27

Work done

More clever handling of queues and tasks
Improving benchmark suite
Sometimes faster than the existing version in Graphs.jl (task-local queues vs shared thread-safe queue)

Debugging

run(SUITE; verbose=true)
Channels close by default with the function-based constructor!
Speed up benchmarks by setting default BenchmarkTools parameters https://juliaci.github.io/BenchmarkTools.jl/stable/manual/#Benchmark-Parameters

Question

How to merge task-local queues more efficiently?
- sizehint! the global to_visit at the beginning
- Queue relies on internal storage that is close to a Vector: allows flushing all at once
- Delineate blocks in to_visit and then parallelize the flushing
- Merge queues recursively and in parallel with treduce
Get rid of global queue and swing between to task-local queues? Imbalance to correct

gdalle commented 7 months ago

2024.04.10

Work done

Everything we ever did was wrong: the queues don't need to be in a thread-safe channel. Launch one task per chunk instead of one per vertex
More tasks than threads for dynamic load-balancing?
Now the runtime is dominated by atomic parent updates, queue merging is marginal
Benchmarks on many more graphs

Todo

Parse benchmarking results into a DataFrame with code like https://github.com/gdalle/HiddenMarkovModels.jl/blob/6bfb23a7684f3fcddfa51989716fdd88ed67c46f/libs/HMMBenchmark/src/suite.jl#L28-L49
Profile several iterations of the algorithm, cleaning used memory every time
Plot speedup as a function of key graph parameters (average degree, diameter)
Switch to one task per thread (max) with randomized chunking => this allows artificially restricting the number of threads without restarting Julia by giving fewer tasks
Shuffle to_visit to avoid locality effects?

Remarks

Check terminology of parallel algorithms to name things right

Presentation

Benchmark against the actual versions in Graphs.jl
Compare the implementation philosophies

gdalle commented 7 months ago

17.04.2024

Work done

Graphs for benchmarking against Graphs.jl: positive results
Very good presentation in front of the lab
Scaling to much larger graphs is possible by not putting them all in the RAM at the same time (tens of millions of vertices)

Todo

Avoid copying files, instead use Graphs.Parallel.Queue if it has the same name
Compare with networkx for the final report
Start implementing a BFS in the GraphBLAS framework with SuiteSparseGraphBLAS.jl
1. Return just distances
2. Try returning parents too

gdalle commented 6 months ago

2024.04.24

Work done

Started looking at GraphBLAS
First BFS implementation based on linear algebra that gives parents
Adjacency matrix makes more sense transposed, play between row and column storage?
- Don't care too much, GraphBLAS does it in the background
Multithreading doesn't work / show up in the profiler?
- for really large matrices, GraphBLAS starts using more threads
- https://docs.julialang.org/en/v1/manual/performance-tips/#man-multithreading-linear-algebra
- ask on Discourse
Still slower than sequential
- Not overly worrying
- When do we expect a speedup?
Adjacency matrix has integer elements?
- Switch to Bool
- Default Graphs.adjacency_matrix returns integers even for unweighted graph, which sucks => open an issue
GraphBLAS doesn't like multiplying a Bool matrix with an Int vector (no mixed types)
- use two vectors, one Bool and one Int, and let them communicate
- change the semiring?

AntoineBut commented 6 months ago

Graph BLAS algorithms descriptions :

http://mit.bme.hu/~szarnyas/grb/graphblas-introduction.pdf

gdalle commented 6 months ago

2024.05.01

Work done

Slow benchmarks due to bug in Graphs.jl graph generator => contributed a fix
Now scales to several million vertices
GraphBLAS scales nicely
Threaded BFS slows down on really large graphs
On densely connected graphs, queue merging is the bottleneck
Oversubscribing with more queues than tasks to balance the load
Sequential greedy coloring
Benchmark of networkx with Python

Todo

For benchmark plotting
- Ensure homogeneous nesting (same depth everywhere)
- Manual parsing of BenchmarkGroup, each nesting level is a column
- Use https://dataframes.juliadata.org/stable/ for data manipulation and plotting
Revisit efficient merging of queues
- Leverage their internal structure as chain of vectors
- Use copyto! from each vector to the right chunk of the global one
Give Bool type argument to Graphs.adjacency_matrix directly
Performance of coloring:
- Vector of Bool for available colors updated in place
Heuristics for coloring:
- Largest Degree First
- Smallest Last: page 691 of https://epubs.siam.org/doi/10.1137/S0036144504444711
Guarantees:
- Guarantee that you have at most max degree + 1 colors for LDF?
- More colors than size of the maximum clique
Parallelize coloring with atomics ?
Benchmark networkx from Julia with https://github.com/JuliaPy/PythonCall.jl
- Graphs from CSV
- Converter to networkx
Don't put the CSVs on GitHub, use https://github.com/oxinabox/DataDeps.jl or https://github.com/tecosaur/DataToolkit.jl
Start working on linalg coloring based on https://people.eecs.berkeley.edu/~aydin/coloring.pdf

Bonus

Nice blog post https://viralinstruction.com/posts/hardware/

gdalle commented 6 months ago

08.05.2024

Work done

Fixed sequential coloring
First BLAS coloring is running
Boolean adjacency matrix was good for BFS, but for coloring we need a sum and promotion to Int is not supported => Keep Int, it's promoted to Float anyway

Todo

Clean up benchmarking and testing by writing functions and putting things in subfiles
Clarify random perturbation of the weight vector
See if you can use Int32
First draft of the repot with structure and discussion of BFS

gdalle commented 6 months ago

17.05.2024

Work done

Not much, students busy this week
Coloring based on max indep set is not working
Better benchmarks

gdalle commented 5 months ago

22.05.2024

Work done

Benchmark plots
Coloring works with better indep set

Discussion

Output of the function can not be recovered from benchmark result
Benchmark on one side, measure number of colors on the other
Results can be aggregated (minimum or median) but you better plot a confidence interval: median as a line, quantiles 25-75 with a colored ribbon or a small bar
Log axes
Vary line styles and marker styles (colorblind-friendly)
Sparse matrices but dense vectors: don't try to store the latter as GBVector
Random number generation takes half the time
Try Float32? Inconclusive
Replace sparse rand with dense rand
Parametrize everything by the random number generator rng
For operations on dense vectors, use LinearAlgebra and https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#Standard-functions
Rand for dense and then switch to GB
Stick to GBVector

gdalle commented 5 months ago

29.05.2024

Work done

Best algorithm for coloring on GPU uses scatter that doesn't belong to GraphBLAS
GraphBLAS coloring slower than (bugfixed) implementation in Graphs.jl

Discussion

Is it reasonable to implement a sequential version of the indepset-based coloring as a comparison point?
Does the indepset-based coloring return the same thing on CPU and GPU if the random seed is the same?
Replace random weights by a predefined ranking to leverage Int operations instead of Float64

Todo

Report v1 for Monday

gdalle / InternsINDY2024