Closed prateek22sri closed 10 months ago
A couple of questions.
If you are using SuiteSparse:GraphBLAS then Tim Davis will have to explain how threading is controlled by the underlying GraphBLAS library (answers to the above questions are important). I do know that in earlier versions of SuiteSparse:GraphBLAS, it would throttle the use of threads if a heuristic indicated there was not enough parallelism. I believe it also has a way of reporting low level decisions like this (look for references to burble).
Several issues:
(1) the bfs_demo program has its own threading control. I don't know what that would work with the environment variable. The program runs a BFS with lots of different # of threads, itself. It would be better to create your list of thread counts inside bfs_demo.c, not outside.
(2) are you timing the whole bfs_demo program? It's calling a single threaded function to read in the matrix, and it's rather slow as well (LAGraph_MMRead I think).
(3) the parallelism inside the LAGraph_BreadthFirstSearch itself is inside each level of the BFS. So if the graph has a high diameter, then each level is small, and there's very little work to do. I have a heuristic that uses fewer threads than the max # of threads when the work is small. With a default "chunk" parameter of about 1 million, and an amount of work w, then I roughly use at most w / chunk threads (also limited by omp_get_max_threads).
(4) I think I entirely ignore the OMP_NUM_THREADS environment setting, unless that revises the return value of omp_get_max_threads (which I think it doesn't). All of my parallel for loops have their own "num_threads(nthreads)" setting, where nthreads is computed as in (3). OMP_NUM_THREADS sets the "nthreads-var" variable in OpenMP. This discussion: https://www.openmp.org/spec-html/5.0/openmpsu35.html#x55-880002.6.1 states (I think) that OMP_NUM_THREADS only changes the # of threads used for parallel regions that do not have a "num_threads" clause. But ALL of my parallel regions have just such a clause.
Thank you @mcmillan03 and @DrTimothyAldenDavis for your responses. @mcmillan03 , the graphs I am using aren't very big, but I am using various sizes but I think I am limiting it to max vertices to be about 20K. I am using the SuiteSparse by Dr. Davis. I haven't modified bfs_demo.c yet, instead I was using vtune to profile it.
In response to @DrTimothyAldenDavis reply,
(1) Does that mean if I run the code on a single node of an MPP, the program will itself optimize the no. of threads without needing any environment variables?
(2)I am profiling the entire thing using vtune, however, I am filtering the function bfs_demo. I am not sure how good a job am I doing in that. Do you have a recommendation as I need baselining the state-of-art implementation for my research?
(3) That's good to know, I will keep an eye out on the diameter of the graph I am trying to profile.
(4) I see.
By any chance do you have a modified bfs_demo.c file that you used for such benchmarking which gives some sort of performance metric?
(1) I think so. Check the output of bfs_demo.c. It lists the # of threads it will use for the tests. You can also revise the bfs_demo.c code to change NTHREAD_LIST and THREAD_LIST to select the experiments to run, each with a different # of threads. Internally, when given some max # of threads it can use, GraphBLAS itself is able to select fewer threads if the work in a parallel region is small.
(2) I'm not sure what you mean by filtering.
For your last question: this bfs_demo.c is what I used for the GAP benchmark. See for example LAGraph/src/benchmark/do_gap_all. I used a binary file format, which you can create using the mtx2bin_demo.c program from a Matrix Market file. See the README.txt file in that folder. Just be sure to use the latest version of GraphBLAS.
The demo programs (which are not well named; they are benchmark programs), will time the underlying functions. They do not include the time for loading the graphs.
20K nodes is a very tiny graph. You won't see much parallelism in graphs that small, particularly in the BFS. The LAGraph BFS takes about half a second to run on a graph with 4 billion edges (GAP/kron, https://sparse.tamu.edu/GAP/GAP-kron), using 20 or 40 threads on a 20-core Intel Xeon. All of my recent timing results are in the LAGraph/src/benchmark/GAP_2023.txt file. You would need to try some much larger graphs to see decent parallelism.
The LAGraph/src/benchmark/GAP_2023.txt file is in the v1.1 branch:
https://github.com/GraphBLAS/LAGraph/blob/v1.1/src/benchmark/GAP_2023.txt
The timing results there are for 40 threads on a 20-core Xeon. I haven't put in the single-thread results in that file.
Thanks so much @DrTimothyAldenDavis for all your responses. This solves my issue. Thanks.
Hi, I am trying to baseline using LAGraph's bfs using OpenMP, however, I think I am not getting the bfs function itself to run in parallel. When I profile, no matter how many OMP threads I give, it does not use more than 4 threads and it might be only for graph reads. I used the following script to run the code. Please let me know if I am doing anything wrong.
`
SBATCH --nodes=1
SBATCH --ntasks-per-node=64
source=$((2)) dataPath=/N/u/pratsriv/BigRed200/Datasets_IPDPS/Real/amazon0302.edgelist_modified
export OMP_PLACES=cores export OMP_PROC_BIND=TRUE
for i in
seq 0 7
; do threads=$((2**$i)) export OMP_NUM_THREADS=$threads srun -N 1 -n 1 -c $threads ./bfs_demo $dataPath $((source)) done `