IntelligentSoftwareSystems / Galois

Galois: C++ library for multi-core and multi-node parallelization
http://iss.ices.utexas.edu/?p=projects/galois
Other
313 stars 133 forks source link

Performance degradation for distributed PageRank #402

Closed OneMoreProblem closed 2 years ago

OneMoreProblem commented 2 years ago

Good day, please can you explain for me some details about distribution processes in Galois. I face with performance degradation when using mpirun maybe you can сlarify it to me with my example.

I process twitter-2010 graph with PageRank algorithm which implemented in Galois. I ran PageRank with 80 threads without mpi:

~/Galois/Galois/lonestar/analytics/distributed/pagerank/pagerank-pull-dist twitter-2010.gr -graphTranspose=twitter-2010.tgr --exec=Sync --runs=1 --maxIterations=20 --output=1 --outputLocation=/home/user/calc/ --statFile=/home/user/calc/stat.log -t=80

It works fast with stats:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 388
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 762
STAT, 0, dGraph_Generic, GraphReading, HMAX, 7252
STAT, 0, DistBench, GraphConstructTime, HMAX, 14383
STAT, 0, ResetGraph_0, Time, HMAX, 39
STAT, 0, InitializeGraph_0, Time, HMAX, 3040
STAT, 0, Gluon, ReduceNumMessages_InitializeGraph_0, HSUM, 0
STAT, 0, Gluon, BroadcastNumMessages_InitializeGraph_0, HSUM, 0
STAT, 0, Gluon, Sync_InitializeGraph_0, HMAX, 1
STAT, 0, Gluon, BroadcastNumMessages_PageRank_0, HSUM, 0
STAT, 0, Gluon, Sync_PageRank_0, HMAX, 1
STAT, 0, PageRank_delta_0, Time, HMAX, 571
STAT, 0, PageRank_0, Time, HMAX, 9794
STAT, 0, PageRank, NumWorkItems_0, HSUM, 29367303640
STAT, 0, PageRank, NumIterations_0, HMAX, 20
STAT, 0, PageRank, Timer_0, HMAX, 10403
STAT, 0, PageRank, TimerTotal, HMAX, 27895
STAT, 0, PageRankSanity, Time, HMAX, 22
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1
PARAM, 0, DistBench, CommandLine, HOST_0, /home/user/Galois/Galois/lonestar/analytics/distributed/pagerank/pagerank-pull-dist twitter-2010.gr -graphTranspose=twitter-2010.tgr --exec=Sync --runs=1 --maxIterations=20 --output=1 --outputLocation=/home/user/calc/ --statFile=/home/user/calc/stat.log -t=80
PARAM, 0, DistBench, Threads, HOST_0, 80
PARAM, 0, DistBench, Hosts, HOST_0, 1
PARAM, 0, DistBench, Runs, HOST_0, 1
PARAM, 0, DistBench, Run_UUID, HOST_0, 1adb9b97-8c0a-4175-9e19-b24e6a54f3e7
PARAM, 0, DistBench, Input, HOST_0, twitter-2010.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, mrc-ailab-ub48
PARAM, 0, PageRank, Max Iterations, HOST_0, 20
PARAM, 0, PageRank, Tolerance, HOST_0, 1e-06
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

All of available 80 cores loaded up to 100%. If i understood correctly STAT, 0, PageRank, TimerTotal, HMAX, 27895 means it was done in 27.895 seconds.

But if i use mpi:

mpirun -n 2 ~/Galois/Galois/lonestar/analytics/distributed/pagerank/pagerank-pull-dist twitter-2010.gr -graphTranspose=twitter-2010.tgr --exec=Sync --runs=1 --maxIterations=20 --output=1 --outputLocation=/home/user/calc/ --statFile=/home/user/calc/stat.log -t=40

I got:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 29520
STAT, 0, dGraph_Generic, TIMER_GRAPH_TRANSPOSE, HMAX, 217796
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 5004
STAT, 0, dGraph_Generic, GraphReading, HMAX, 3515
STAT, 0, TRANSPOSE_EDGEINTDATA_COPY, Time, HMAX, 290
STAT, 0, TRANSPOSE_EDGEINTDATA_INC, Time, HMAX, 40795
STAT, 0, TRANSPOSE_EDGEINTDATA_SET, Time, HMAX, 135
STAT, 0, TRANSPOSE_EDGEINTDATA_TEMP, Time, HMAX, 82
STAT, 0, TRANSPOSE_EDGEDST, Time, HMAX, 176390
STAT, 0, DistBench, GraphConstructTime, HMAX, 262721
STAT, 0, ResetGraph_0, Time, HMAX, 879
STAT, 0, InitializeGraph_0, Time, HMAX, 14886
STAT, 0, Gluon, BroadcastSendBytes_InitializeGraph_0, HSUM, 103854808
STAT, 0, Gluon, BroadcastNumMessages_InitializeGraph_0, HSUM, 2
STAT, 0, Gluon, Sync_InitializeGraph_0, HMAX, 3804
STAT, 0, Gluon, ReduceSendBytes_PageRank_0, HSUM, 2095417680
STAT, 0, Gluon, ReduceNumMessages_PageRank_0, HSUM, 40
STAT, 0, Gluon, Sync_PageRank_0, HMAX, 20654
STAT, 0, PageRank_delta_0, Time, HMAX, 4322
STAT, 0, RESET:MIRRORS_0, Time, HMAX, 634
STAT, 0, PageRank_0, Time, HMAX, 145668
STAT, 0, PageRank, NumWorkItems_0, HSUM, 29367303640
STAT, 0, PageRank, NumIterations_0, HMAX, 20
STAT, 0, PageRank, Timer_0, HMAX, 161024
STAT, 0, PageRank, TimerTotal, HMAX, 442084
STAT, 0, PageRankSanity, Time, HMAX, 260
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1.62032
PARAM, 0, DistBench, CommandLine, HOST_0, /home/user/Galois/Galois/lonestar/analytics/distributed/pagerank/pagerank-pull-dist twitter-2010.gr -graphTranspose=twitter-2010.tgr --exec=Sync --runs=1 --maxIterations=20 --output=1 --outputLocation=/home/user/calc/ --statFile=/home/user/calc/stat.log -t=40
PARAM, 0, DistBench, Threads, HOST_0, 2
PARAM, 0, DistBench, Hosts, HOST_0, 2
PARAM, 0, DistBench, Runs, HOST_0, 1
PARAM, 0, DistBench, Run_UUID, HOST_0, c81e25db-80d2-4138-a170-175421fad038
PARAM, 0, DistBench, Input, HOST_0, twitter-2010.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, mrc-ailab-ub48
PARAM, 0, PageRank, Max Iterations, HOST_0, 20
PARAM, 0, PageRank, Tolerance, HOST_0, 1e-06
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

Only 2-3 of 80 available cores loaded up to 100%, and STAT, 0, PageRank, TimerTotal, HMAX, 442084.

And i have some questions about this situation:

1) Is it normal behavior for distributed task or i just made a mistake in parameters of launching or building?

2) Is performance degradation occurs due to high cost for communications between processes?

System configuration: 80x Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHzll Compiler: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

Best regards.

roshandathathri commented 2 years ago

Which MPI runtime are you using? The key is to NOT bind threads in MPI or in Galois.

Something like mpirun --bind-to none -n ... may be sufficient.

OneMoreProblem commented 2 years ago

Which MPI runtime are you using? The key is to NOT bind threads in MPI or in Galois.

Something like mpirun --bind-to none -n ... may be sufficient.

GALOIS_DO_NOT_BIND_THREADS=1? Yes this variable is set to 1 in my environment. But mpirun -n 2 ... -t 40 still load only 4 processors and require much more time to execute it.

roshandathathri commented 2 years ago

It's not just about Galois. A correction to my previous statement.

The key is to NOT bind threads in MPI AND in Galois.

OneMoreProblem commented 2 years ago

It's not just about Galois. A correction to my previous statement.

The key is to NOT bind threads in MPI AND in Galois.

I try to, mpirun --bind-to none -n2... -t 40

Yes it's working, thank you.

roshandathathri commented 2 years ago

You're welcome

Sanzo00 commented 2 years ago

I also encountered a similar problem when I ran distributed version before (mpirun -n 2). I solved this problem with GALOIS_DO_NOT_BIND_THREADS=1 && mpirun -n 2 -t 72, but I don't know why, doesn't binding the core reduce the overhead of thread switching, and why does binding the core reduce performance?

roshandathathri commented 2 years ago

Regarding MPI, MPI should NOT be binding threads/processes because Galois uses multi-threaded processes and manages threads internally. Binding in MPI kills performance.

Regarding GALOIS_DO_NOT_BIND_THREADS, Galois uses a dedicated communication thread, so binding all the compute threads hurts the performance of the communication thread. Binding all compute threads in Galois impacts performance but not as severely as binding in MPI.

Sanzo00 commented 2 years ago

I see, thanks for your reply.