New paper that yields 12-24x faster runtime for BFS

alexandergunnarson commented 2 years ago

Is your feature request related to a problem? Please describe. Not a problem; just came across a recent paper (April 2021) that claims to dramatically increase the performance of BFS over current implementations of GraphBLAS. We're going to be using RedisGraph's BFS functionality, and RedisGraph uses SuiteSparse.

Describe the solution you'd like Potentially implement the paper in question and thereby increase BFS performance.

Describe alternatives you've considered N/A

Additional context While I'm a software engineer, I'm not at all a specialist in BLAS or C programming, so if this paper is not implementable as written, or if it's too much of a refactoring effort, I understand. Just wanted to bring this to your attention if it's helpful!

Also, let me know if this is a feature discussion that belongs upstream (e.g. the GraphBLAS C API). Again, not being a specialist in this, I'm not sure whether the paper suggests an overhaul in current GraphBLAS implementations or in the GraphBLAS API itself.

DrTimothyAldenDavis commented 2 years ago

Yes, I know about the paper. Its benchmarking is flawed, however. It was compared against an older version of GraphBLAS, and even then it used a demo BFS in GraphBLAS, not the latest (faster) version in LAGraph. The paper is out of date. GraphBLAS is many many times faster than the BFS described in the paper.

alexandergunnarson commented 2 years ago

Incredibly quick response — thank you! And thanks for the context. Makes sense.

DrTimothyAldenDavis commented 2 years ago

No problem. Glad to help.

Another flaw in the paper ... the demo BFS method in GraphBLAS that was used for the comparison had "this is for illustration only" written on it in the comments. I've since deleted that BFS and now the BFS in LAGraph serves both purposes: it's now a polished, elegant BFS that is easy to read, and it's also fast.

The latest BFS in LAGraph + GraphBLAS is nearly as fast as the fastest known BFS in the GAP Benchmark by Scott Beamer. For example, on a graph of about 4 billion edges, the GAP BFS finishes in about 0.33 seconds. LAGraph+GraphBLAS takes 0.5 seconds. It is using 64-bit integers throughout, because GraphBLAS needs to be able to handle huge problems. The GAP BFS uses 32-bit integers and gains a performance boost as a result. The test graphs it handles have about 2^32 minus 500 edges ... in other words, any larger and 64-bit indices would be needed for the GAP as well.

I'd like to compare the GAP with 64-bit integers someday to see how much difference this makes.

The paper on arxiv should have compared against a polished BFS method with more stable performance such as the one in the GAP by Scott Beamer. That's the method to beat, not mine. The LAGraph+GraphBLAS BFS was under intense development as that paper was being written, and the current BFS is perhaps 50x to 100x faster than my first try at the problem. The demo BFS wasn't in LAGraph at all, just in my GraphBLAS/Demo folder.

alexandergunnarson commented 2 years ago

Really interesting notes — thanks! Agreed on your assessment. Yes, 64-bit integers are important for our use case, as we'll likely exceed 4B edges in the next few months (I run an ecommerce startup). We're happy to eat the ~35% performance hit to support large graphs! And 4B edges in the scheme of things is not even that large — there are several companies I'm aware of running graph DBs with trillions of edges.

Unrelated question that I'm happy to post as a separate issue, or just ask RedisGraph directly — given that RedisGraph uses SuiteSparse under the hood, will it support GPU acceleration automatically (since from what I read, SuiteSparse supports GPU acceleration) if we run it on a GPU instance? I assume not, since ostensibly it would be on the burden of RedisGraph to move graphs (or portions thereof) to and from GPU memory, but just curious.

DrTimothyAldenDavis commented 2 years ago

Once GraphBLAS exploits the GPU, it will speed up RedisGraph with no extra effort on its part in general. It might help for Redis Graph to give GraphBLAS some optional hints though. We’ll see how it works out.

alexandergunnarson commented 2 years ago

it will speed up RedisGraph with no extra effort on its part in general.

Awesome! This is exciting to hear.

Once GraphBLAS exploits the GPU

Curious — is this in the roadmap for SuiteSparse? Wondering if it would be straightforward to adapt parts of GraphBLAST for the purpose. I'm assuming it's very much non-trivial.

DrTimothyAldenDavis commented 2 years ago

We're working on GPU acceleration for SuiteSparse:GraphBLAS now.

alexandergunnarson commented 2 years ago

Exciting — can't wait to try it out once it's ready! (I'm assuming it'll be a year or two?)

In any case, I'm sure once you're done, the Redis team will write up a cool-looking blog post saying "RedisGraph is now 10x faster". Hope you and any other SuiteSparse contributors get featured and get the spotlight you deserve 😊

DrTimothyAldenDavis commented 2 years ago

Thanks! We should have a first cut done well before then. We have some draft CUDA kernels working already (see the GraphBLAS/CUDA folder, not here in the stable SuiteSparse meta-library, but here: https://github.com/DrTimothyAldenDavis/GraphBLAS/tree/master/CUDA ), in collaboration with @jeaton32 and @cjnolet at NVIDIA.

We are extending the process in which we create those kernels, to use a JIT. That way, we can build functions that use arbitrary types, operators, and semirings, at run time, and not precompile thousands of CUDA kernels.

Our plan is to introduce the CUDA kernels incrementally, which will work fine because we're using unified memory management. Then some calls to GrB methods will exploit the GPU, and some will not. Over time, more will use the GPU. Our first goal is to finish the specific use of GrB_mxm needed by triangle counting, followed by GrB_reduce (also for triangle counting), to see how fast we can make triangle counting. Then we will branch out and write CUDA kernels for more of the GrB methods.

This will all be transparent to the user application. For example, the LAGraph triangle count will be just the same, only faster for large problems. For best performance, we might need to add optional hints via GxB_set, so the user application can tell GraphBLAS to use, or not use, the GPU, or how many GPUs to use, or where to place the data (on the CPU or GPU), and so on.

User-defined types and operators can use the new GxB_Type_new and related methods, where we can take in strings that define the types; see the latest v6.1.3 release for this. That way, we can create and compile CUDA kernels at run time, which use these strings to create user-defined functions and typedefs.

alexandergunnarson commented 1 year ago

Sorry I never got back to you — thanks so much for explaining all that! I took a cursory look at https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/master/CUDA/jitify.hpp. Fascinating you're able to use JIT compilation for CUDA kernels!

On your point that "the user application [should be able to] tell GraphBLAS ... where to place the data (on the CPU or GPU)", that flexibility is really appealing to me. Bandwidth between CPU and GPU increases every year, but it's always best to bring the compute as close to the data as possible. I wonder whether, much like in-memory DBs seemed extravagant in the age of magnetic disks but are now relatively commonplace, we'll the rise of on-GPU graph DBs that are finally able to exploit all those cores with mmults. Return of Moore's Law, in a way!

Also, related to GPUs but unrelated to the discussion above, not sure if you're familiar with the papers below, but it's interesting they claim to outperform NSPARSE and spECK (cuSPARSE, too, but I see that outperformed much more often). AFAIK you're still experimenting with CUDA support for SuiteSparse GraphBLAS, so might be interesting to take a look FWIW:

DrTimothyAldenDavis commented 1 year ago

The CUDA kernels in SuiteSparse:GraphBLAS are still a work in progress, but it's coming along nicely. Our plan is to select between the CPU and GPU automatically, based on where the data is and how big the problem is. We'll also provide hints down to the matrix level, like "this matrix A should go the GPU if possible", via GxB_set (A, ..., ...). Those hints will be optional however. Using the GPU will be very simple: just use LAGr_Init (GxB_NONBLOCKING_GPU, ...), also passing in the wrappers for the Rapids Memory manager (rmm_wrap_malloc and friends). The rest will be automatic.

I wasn't aware of those papers but I'll take a look.

alexandergunnarson commented 1 year ago

Thanks so much for your prompt response! Always a pleasure to hear back from you so quickly, the fountain of knowledge that you are :)

Very exciting that there's progress there!! I'm curious — what state of readiness is it in, and do you have a timeline you're targeting? I'm presuming we could we use SuiteSparse's sparse boolean matrix multiplication on CUDA right now if we compiled from stable, but there are additional GraphBLAS features that have yet to be fully implemented.

Some graph DB architectures I've seen (e.g. https://github.com/SJTU-IPADS/wukong-g) ferry data back and forth between GPU and main memory — using the GPU as a cache of sorts — but I can envision there being significant speedups if the data can fit in GPU memory.

DrTimothyAldenDavis / SuiteSparse

New paper that yields 12-24x faster runtime for BFS #116