prateek22sri commented 10 months ago

Hi, I am trying to baseline using LAGraph's bfs using OpenMP, however, I think I am not getting the bfs function itself to run in parallel. When I profile, no matter how many OMP threads I give, it does not use more than 4 threads and it might be only for graph reads. I used the following script to run the code. Please let me know if I am doing anything wrong.

`

SBATCH --nodes=1

SBATCH --ntasks-per-node=64

source=$((2)) dataPath=/N/u/pratsriv/BigRed200/Datasets_IPDPS/Real/amazon0302.edgelist_modified

export OMP_PLACES=cores export OMP_PROC_BIND=TRUE

for i in seq 0 7 ; do threads=$((2**$i)) export OMP_NUM_THREADS=$threads srun -N 1 -n 1 -c $threads ./bfs_demo $dataPath $((source)) done `

mcmillan03 commented 10 months ago

A couple of questions.

How big is this graph? #verts, #edges, and diameter if you have it.
What version of the GraphBLAS library are you using?
Have you modified the bfs_demo.c file? (Note it sets the number of threads available at run time).

If you are using SuiteSparse:GraphBLAS then Tim Davis will have to explain how threading is controlled by the underlying GraphBLAS library (answers to the above questions are important). I do know that in earlier versions of SuiteSparse:GraphBLAS, it would throttle the use of threads if a heuristic indicated there was not enough parallelism. I believe it also has a way of reporting low level decisions like this (look for references to burble).

DrTimothyAldenDavis commented 10 months ago

Several issues:

(1) the bfs_demo program has its own threading control. I don't know what that would work with the environment variable. The program runs a BFS with lots of different # of threads, itself. It would be better to create your list of thread counts inside bfs_demo.c, not outside.

(2) are you timing the whole bfs_demo program? It's calling a single threaded function to read in the matrix, and it's rather slow as well (LAGraph_MMRead I think).

(3) the parallelism inside the LAGraph_BreadthFirstSearch itself is inside each level of the BFS. So if the graph has a high diameter, then each level is small, and there's very little work to do. I have a heuristic that uses fewer threads than the max # of threads when the work is small. With a default "chunk" parameter of about 1 million, and an amount of work w, then I roughly use at most w / chunk threads (also limited by omp_get_max_threads).

(4) I think I entirely ignore the OMP_NUM_THREADS environment setting, unless that revises the return value of omp_get_max_threads (which I think it doesn't). All of my parallel for loops have their own "num_threads(nthreads)" setting, where nthreads is computed as in (3). OMP_NUM_THREADS sets the "nthreads-var" variable in OpenMP. This discussion: https://www.openmp.org/spec-html/5.0/openmpsu35.html#x55-880002.6.1 states (I think) that OMP_NUM_THREADS only changes the # of threads used for parallel regions that do not have a "num_threads" clause. But ALL of my parallel regions have just such a clause.

prateek22sri commented 10 months ago

Thank you @mcmillan03 and @DrTimothyAldenDavis for your responses. @mcmillan03 , the graphs I am using aren't very big, but I am using various sizes but I think I am limiting it to max vertices to be about 20K. I am using the SuiteSparse by Dr. Davis. I haven't modified bfs_demo.c yet, instead I was using vtune to profile it.

In response to @DrTimothyAldenDavis reply,

(1) Does that mean if I run the code on a single node of an MPP, the program will itself optimize the no. of threads without needing any environment variables?

(2)I am profiling the entire thing using vtune, however, I am filtering the function bfs_demo. I am not sure how good a job am I doing in that. Do you have a recommendation as I need baselining the state-of-art implementation for my research?

(3) That's good to know, I will keep an eye out on the diameter of the graph I am trying to profile.

(4) I see.

By any chance do you have a modified bfs_demo.c file that you used for such benchmarking which gives some sort of performance metric?

DrTimothyAldenDavis commented 10 months ago

(1) I think so. Check the output of bfs_demo.c. It lists the # of threads it will use for the tests. You can also revise the bfs_demo.c code to change NTHREAD_LIST and THREAD_LIST to select the experiments to run, each with a different # of threads. Internally, when given some max # of threads it can use, GraphBLAS itself is able to select fewer threads if the work in a parallel region is small.

(2) I'm not sure what you mean by filtering.

For your last question: this bfs_demo.c is what I used for the GAP benchmark. See for example LAGraph/src/benchmark/do_gap_all. I used a binary file format, which you can create using the mtx2bin_demo.c program from a Matrix Market file. See the README.txt file in that folder. Just be sure to use the latest version of GraphBLAS.

The demo programs (which are not well named; they are benchmark programs), will time the underlying functions. They do not include the time for loading the graphs.

20K nodes is a very tiny graph. You won't see much parallelism in graphs that small, particularly in the BFS. The LAGraph BFS takes about half a second to run on a graph with 4 billion edges (GAP/kron, https://sparse.tamu.edu/GAP/GAP-kron), using 20 or 40 threads on a 20-core Intel Xeon. All of my recent timing results are in the LAGraph/src/benchmark/GAP_2023.txt file. You would need to try some much larger graphs to see decent parallelism.

DrTimothyAldenDavis commented 10 months ago

The LAGraph/src/benchmark/GAP_2023.txt file is in the v1.1 branch:

https://github.com/GraphBLAS/LAGraph/blob/v1.1/src/benchmark/GAP_2023.txt

The timing results there are for 40 threads on a 20-core Xeon. I haven't put in the single-thread results in that file.

prateek22sri commented 10 months ago

Thanks so much @DrTimothyAldenDavis for all your responses. This solves my issue. Thanks.

GraphBLAS / LAGraph

Unable to use parallelism in bfs #241

SBATCH --nodes=1

SBATCH --ntasks-per-node=64