Closed OneMoreProblem closed 2 years ago
Which MPI runtime are you using? The key is to NOT bind threads in MPI or in Galois.
Something like mpirun --bind-to none -n ...
may be sufficient.
Which MPI runtime are you using? The key is to NOT bind threads in MPI or in Galois.
Something like
mpirun --bind-to none -n ...
may be sufficient.
GALOIS_DO_NOT_BIND_THREADS=1? Yes this variable is set to 1 in my environment. But mpirun -n 2 ... -t 40
still load only 4 processors and require much more time to execute it.
It's not just about Galois. A correction to my previous statement.
The key is to NOT bind threads in MPI AND in Galois.
It's not just about Galois. A correction to my previous statement.
The key is to NOT bind threads in MPI AND in Galois.
I try to, mpirun --bind-to none -n2... -t 40
Yes it's working, thank you.
You're welcome
I also encountered a similar problem when I ran distributed version before (mpirun -n 2). I solved this problem with GALOIS_DO_NOT_BIND_THREADS=1 && mpirun -n 2 -t 72
, but I don't know why, doesn't binding the core reduce the overhead of thread switching, and why does binding the core reduce performance?
Regarding MPI, MPI should NOT be binding threads/processes because Galois uses multi-threaded processes and manages threads internally. Binding in MPI kills performance.
Regarding GALOIS_DO_NOT_BIND_THREADS, Galois uses a dedicated communication thread, so binding all the compute threads hurts the performance of the communication thread. Binding all compute threads in Galois impacts performance but not as severely as binding in MPI.
I see, thanks for your reply.
Good day, please can you explain for me some details about distribution processes in Galois. I face with performance degradation when using
mpirun
maybe you can сlarify it to me with my example.I process twitter-2010 graph with PageRank algorithm which implemented in Galois. I ran PageRank with 80 threads without mpi:
It works fast with stats:
All of available 80 cores loaded up to 100%. If i understood correctly
STAT, 0, PageRank, TimerTotal, HMAX, 27895
means it was done in 27.895 seconds.But if i use mpi:
I got:
Only 2-3 of 80 available cores loaded up to 100%, and
STAT, 0, PageRank, TimerTotal, HMAX, 442084
.And i have some questions about this situation:
1) Is it normal behavior for distributed task or i just made a mistake in parameters of launching or building?
2) Is performance degradation occurs due to high cost for communications between processes?
System configuration: 80x Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHzll Compiler: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Best regards.