IntelligentSoftwareSystems / Galois

Galois: C++ library for multi-core and multi-node parallelization
http://iss.ices.utexas.edu/?p=projects/galois
Other
310 stars 131 forks source link

Output of distributed Galois is confusing #389

Open YuxinxinChen opened 2 years ago

YuxinxinChen commented 2 years ago

Hi Galois Team,

I am trying to run pagerank-push-dist with 2 nodes: mpirun -n 2 $ROOT/lonestar/analytics/distributed/pagerank/pagerank-push-dist mygraph.gr --num_nodes=2 --partition=oec --pset=g

The output is confusing. It looks to me only process 0 is running. Besides, for the input, if I would like to use partition ginger-o, how can I get transposed .tgr file?

Thanks!

l-hoang commented 2 years ago

Please post your output here.

ginger-o can be specified using the --partition option (use -h to see the correct argument). For the transpose grpah, use the graph-convert tool under the tools directory at root level.

You can use mpirun --tag-output to confirm if many processes are running, though I see nothing wrong with the way you are running it there (that command indicates 2 machines each with 1 gpu: if you want 1 machine with 2 GPUs on it, do mpirun -n 1 --pset=gg)

nicelhc13 commented 2 years ago

I am suspecting that you may not run your applications on 2 nodes. Please check if your clustering (e.g. SLURM) is being used correctly. If you want to run this on 1 nodes and 2 hosts, you should specify --pset=gg as Loc pointed out.

YuxinxinChen commented 2 years ago

Here is my output

D-Galois Benchmark Suite v6.0.0 (unknown)
Copyright (C) 2018 The University of Texas at Austin
http://iss.ices.utexas.edu/galois/

application: PageRank - Compiler Generated Distributed Heterogeneous
Residual PageRank on Distributed Galois.

[0] Master distribution time : 0.046137 seconds to read 216 bytes in 27 seeks (0.00468171 MBPS)
[0] Starting graph reading.
[0] Reading graph complete.
[0] Edge inspection time: 28.4214 seconds to read 4335433632 bytes (152.541 MBPS)
Loading edge-data while creating edges
[0] Edge loading time: 107.842 seconds to read 4335433632 bytes (40.2018 MBPS)
[0] Graph construction complete.
[0] Using GPU 0: Tesla V100-SXM2-16GB
[0] Host memory for communication context: 1338 MB
[0] Host memory for graph: 7850 MB
[0] InitializeGraph::go called
[0] PageRank::go run 0 called
Max rank is 76137.8
Min rank is 0.15
Rank sum is 3.4924e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 1 called
Max rank is 76139.8
Min rank is 0.15
Rank sum is 3.49271e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 2 called
Max rank is 76110.6
Min rank is 0.15
Rank sum is 3.49043e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 3 called
Max rank is 76122
Min rank is 0.15
Rank sum is 3.49152e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 4 called
Max rank is 76127.7
Min rank is 0.15
Rank sum is 3.49148e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 5 called
Max rank is 76109.1
Min rank is 0.15
Rank sum is 3.49026e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 6 called
Max rank is 77099.7
Min rank is 0.15
Rank sum is 3.53274e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 7 called
Max rank is 76722.2
Min rank is 0.15
Rank sum is 3.51755e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 8 called
Max rank is 76370
Min rank is 0.15
Rank sum is 3.50499e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 9 called
Max rank is 76111.7
Min rank is 0.15
Rank sum is 3.49062e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[1] Master distribution time : 0.064927 seconds to read 416 bytes in 52 seeks (0.0064072 MBPS)
[1] Starting graph reading.
[1] Reading graph complete.
[1] Edge inspection time: 29.8542 seconds to read 4335991372 bytes (145.239 MBPS)
[1] Edge loading time: 104.426 seconds to read 4335991372 bytes (41.522 MBPS)
[1] Graph construction complete.
[1] Using GPU 0: Tesla V100-SXM2-16GB
[1] Host memory for communication context: 1338 MB
[1] Host memory for graph: 7850 MB
[1] InitializeGraph::go called
[1] PageRank::go run 0 called
[1] PageRank::go run 1 called
[1] PageRank::go run 2 called
[1] PageRank::go run 3 called
[1] PageRank::go run 4 called
[1] PageRank::go run 5 called
[1] PageRank::go run 6 called
[1] PageRank::go run 7 called
[1] PageRank::go run 8 called
[1] PageRank::go run 9 called

The PE 1 seems called PageRank::go without any detailed information. More specifically, if I would like to know the communication volume between two processes, do you have some handy tools or stats available for that?

Thanks!

nicelhc13 commented 2 years ago

Thank you. This looks slightly weird to me. Let me try to reproduce your problems. Do you still use the same command? Could you please provide me results of the nvdia-smi command?

Regarding communication volumes of Gluon, you could enable this flag through CMake: GALOIS_COMM_STATS. It provides reduce/broadcast communication volumes. (You can find the detailed information from libgluon/include/galois/graphs/GluonSubstrate.h)

YuxinxinChen commented 2 years ago

Yes, same command. Also I am using oak ridge Summit but it should work as Slurm. I am not sure if that causes the problem but Summit MPI has been working well for my other applications.

Here is the result of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   29C    P0    37W / 300W |      0MiB / 16160MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Mon Feb 14 15:13:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   29C    P0    37W / 300W |      0MiB / 16160MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

------------------------------------------------------------
l-hoang commented 2 years ago

your output looks sane to me, actually: both hosts ([0] and [1] are picking up the GPU local to that host and things are running); the sanity getting printed only prints on host 0, and the prints are out of order since you can't guarantee when things get flushed to disk

do you have the stats files? (you can save it to disk with -statFile= else it's outputted to stdout, and Hochan mentioned the flag you can use to get more communication info)

YuxinxinChen commented 2 years ago

The stats file, but no stats information on PE 1:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 107841
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 29854
STAT, 0, dGraph_Generic, GraphReading, HMAX, 2329
STAT, 0, DistBench, GraphConstructTime, HMAX, 146516
STAT, 0, DistBench, TIMER_GRAPH_MARSHAL, HMAX, 8280
STAT, 0, PageRank, ResetGraph_0, HMAX, 1
STAT, 0, PageRank, InitializeGraph_0, HMAX, 1
STAT, 0, PageRank, PageRank_0, HMAX, 10574
STAT, 0, PageRank, NumWorkItems_0, HSUM, 201823910
STAT, 0, PageRank, NumIterations_0, HOST_0, 1046
STAT, 0, PageRank, ResetGraph_1, HMAX, 1
STAT, 0, PageRank, InitializeGraph_1, HMAX, 1
STAT, 0, PageRank, Timer_0, HMAX, 14083
STAT, 0, PageRank, PageRank_1, HMAX, 10763
STAT, 0, PageRank, NumWorkItems_1, HSUM, 201764411
STAT, 0, PageRank, NumIterations_1, HOST_0, 1115
STAT, 0, PageRank, InitializeGraph_2, HMAX, 1
STAT, 0, PageRank, Timer_1, HMAX, 14097
STAT, 0, PageRank, PageRank_2, HMAX, 10932
STAT, 0, PageRank, NumWorkItems_2, HSUM, 202000910
STAT, 0, PageRank, NumIterations_2, HOST_0, 1024
STAT, 0, PageRank, InitializeGraph_3, HMAX, 1
STAT, 0, PageRank, Timer_2, HMAX, 13794
STAT, 0, PageRank, PageRank_3, HMAX, 10792
STAT, 0, PageRank, NumWorkItems_3, HSUM, 201907168
STAT, 0, PageRank, NumIterations_3, HOST_0, 966
STAT, 0, PageRank, ResetGraph_4, HMAX, 21
STAT, 0, PageRank, InitializeGraph_4, HMAX, 1
STAT, 0, PageRank, Timer_3, HMAX, 13477
STAT, 0, PageRank, PageRank_4, HMAX, 10980
STAT, 0, PageRank, NumWorkItems_4, HSUM, 201849735
STAT, 0, PageRank, NumIterations_4, HOST_0, 946
STAT, 0, PageRank, InitializeGraph_5, HMAX, 1
STAT, 0, PageRank, Timer_4, HMAX, 13791
STAT, 0, PageRank, PageRank_5, HMAX, 11154
STAT, 0, PageRank, NumWorkItems_5, HSUM, 201998277
STAT, 0, PageRank, NumIterations_5, HOST_0, 1142
STAT, 0, PageRank, InitializeGraph_6, HMAX, 1
STAT, 0, PageRank, Timer_5, HMAX, 14269
STAT, 0, PageRank, PageRank_6, HMAX, 10076
STAT, 0, PageRank, NumWorkItems_6, HSUM, 193896959
STAT, 0, PageRank, NumIterations_6, HOST_0, 1063
STAT, 0, PageRank, ResetGraph_7, HMAX, 20
STAT, 0, PageRank, InitializeGraph_7, HMAX, 1
STAT, 0, PageRank, Timer_6, HMAX, 13303
STAT, 0, PageRank, PageRank_7, HMAX, 10584
STAT, 0, PageRank, NumWorkItems_7, HSUM, 198799184
STAT, 0, PageRank, NumIterations_7, HOST_0, 932
STAT, 0, PageRank, ResetGraph_8, HMAX, 1
STAT, 0, PageRank, InitializeGraph_8, HMAX, 1
STAT, 0, PageRank, Timer_7, HMAX, 13503
STAT, 0, PageRank, PageRank_8, HMAX, 10444
STAT, 0, PageRank, NumWorkItems_8, HSUM, 200358039
STAT, 0, PageRank, NumIterations_8, HOST_0, 964
STAT, 0, PageRank, InitializeGraph_9, HMAX, 1
STAT, 0, PageRank, Timer_8, HMAX, 13283
STAT, 0, PageRank, PageRank_9, HMAX, 10713
STAT, 0, PageRank, NumWorkItems_9, HSUM, 202030884
STAT, 0, PageRank, NumIterations_9, HOST_0, 1022
STAT, 0, PageRank, Timer_9, HMAX, 13979
STAT, 0, PageRank, TimerTotal, HMAX, 293282
STAT, 0, PageRank, ResetGraph_2, HMAX, 1
STAT, 0, PageRank, ResetGraph_3, HMAX, 1
STAT, 0, PageRank, ResetGraph_5, HMAX, 1
STAT, 0, PageRank, ResetGraph_9, HMAX, 1
STAT, 0, Gluon, ReduceSendBytes_PageRank_0, HSUM, 3713563396
STAT, 0, Gluon, ReduceNumMessages_PageRank_0, HSUM, 348
STAT, 0, Gluon, Sync_PageRank_0, HMAX, 1618
STAT, 0, Gluon, ReduceSendBytes_PageRank_1, HSUM, 3716312728
STAT, 0, Gluon, ReduceNumMessages_PageRank_1, HSUM, 290
STAT, 0, Gluon, Sync_PageRank_1, HMAX, 1353
STAT, 0, Gluon, ReduceSendBytes_PageRank_2, HSUM, 3744828140
STAT, 0, Gluon, ReduceNumMessages_PageRank_2, HSUM, 288
STAT, 0, Gluon, Sync_PageRank_2, HMAX, 1266
STAT, 0, Gluon, ReduceSendBytes_PageRank_3, HSUM, 3713466332
STAT, 0, Gluon, ReduceNumMessages_PageRank_3, HSUM, 365
STAT, 0, Gluon, Sync_PageRank_3, HMAX, 1157
STAT, 0, Gluon, ReduceSendBytes_PageRank_4, HSUM, 3741462848
STAT, 0, Gluon, ReduceNumMessages_PageRank_4, HSUM, 364
STAT, 0, Gluon, Sync_PageRank_4, HMAX, 1143
STAT, 0, Gluon, ReduceSendBytes_PageRank_5, HSUM, 3751941512
STAT, 0, Gluon, ReduceNumMessages_PageRank_5, HSUM, 388
STAT, 0, Gluon, Sync_PageRank_5, HMAX, 1504
STAT, 0, Gluon, ReduceSendBytes_PageRank_6, HSUM, 3489017220
STAT, 0, Gluon, ReduceNumMessages_PageRank_6, HSUM, 395
STAT, 0, Gluon, Sync_PageRank_6, HMAX, 1393
STAT, 0, Gluon, ReduceSendBytes_PageRank_7, HSUM, 3628752704
STAT, 0, Gluon, ReduceNumMessages_PageRank_7, HSUM, 319
STAT, 0, Gluon, Sync_PageRank_7, HMAX, 1219
STAT, 0, Gluon, ReduceSendBytes_PageRank_8, HSUM, 3600264744
STAT, 0, Gluon, ReduceNumMessages_PageRank_8, HSUM, 359
STAT, 0, Gluon, Sync_PageRank_8, HMAX, 1110
STAT, 0, Gluon, ReduceSendBytes_PageRank_9, HSUM, 3723709080
STAT, 0, Gluon, ReduceNumMessages_PageRank_9, HSUM, 333
STAT, 0, Gluon, Sync_PageRank_9, HMAX, 1356
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1.85977
PARAM, 0, DistBench, CommandLine, HOST_0, /ccs/home/yuxinc/Galois/build/gcc-11.1-nvcc-11.4/lonestar/analytics/distributed/pagerank/pagerank-push-dist /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/twitter/twitter/twitter-ICWSM10-component.gr --num_nodes=2 --partition=oec --pset=g --runs=10 --exec=Async --tolerance=0.01 --statFile=/gpfs/alpine/bif115/scratch/yuxinc/Galois/pagearank-push-dist/pr-o%j.output
PARAM, 0, DistBench, Threads, HOST_0, 1
PARAM, 0, DistBench, Hosts, HOST_0, 2
PARAM, 0, DistBench, Runs, HOST_0, 10
PARAM, 0, DistBench, Run_UUID, HOST_0, dd472ca4-8a16-4966-9930-573dc7646475
PARAM, 0, DistBench, Input, HOST_0, /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/twitter/twitter/twitter-ICWSM10-component.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, h30n09
PARAM, 0, PageRank, Max Iterations, HOST_0, 1000
PARAM, 0, PageRank, Tolerance, HOST_0, 0.01
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

Particularly, I am interested in the load balancing on different processes and the communication volume between processes. From the stats results, I got no information for PE 1. Besides the libgluon/include/galois/graphs/GluonSubstrate.h, is there any tools or stats available for local workload?

l-hoang commented 2 years ago

If you want per host timers in the stats file, set GALOIS_PRINT_PER_HOST_STATS=1 when you run the program.

Local workload are the InitializeGraph, PageRank, etc timers (one for each run); you can get the timer names by looking at the page rank source and looking at the timers surrouding each compute.

YuxinxinChen commented 2 years ago

I didn't find GALOIS_PRINT_PER_HOST_STATS either macro or CMake flag or environmental variable or variable in your master branch code. Could you explain more about setting GALOIS_PRINT_PER_HOST_STATS=1 when I run the program?

Thanks in advance!

nicelhc13 commented 2 years ago

Now this flag is PRINT_PER_HOST_STATS=1. Could you please use this flag? Below are my command and part of result:

PRINT_PER_HOST_STATS=1 mpirun -np 2 ./pagerank-push-dist test.tgr --num_nodes=2 --partition=oec --pset=g

STAT, 0, PageRank, NumWorkItems_0, HostValues, 9735; 24846
STAT, 0, PageRank, PageRank_0, HMAX, 16
STAT, 0, PageRank, PageRank_0, HostValues, 16; 12
STAT, 0, PageRank, NumIterations_0, HOST_0, 185
STAT, 0, PageRank, NumIterations_0, HostValues, 185
STAT, 0, PageRank, Timer_0, HMAX, 28             
STAT, 0, PageRank, Timer_0, HostValues, 27; 28

For example, as you can see STAT, 0, PageRank, Timer_0, HostValues, 27; 28, the first 27 is runtime in milisec of host 0, and the next 28 is runtime in millisec of host 1.

l-hoang commented 2 years ago

slight correction: the order of appareance of the HostValues does not correspond to the host; e.g. the first 27 isn't necessarily host 0

unfortunately the stats have no way to distinguish which timer belongs to which at the moment

YuxinxinChen commented 2 years ago

Thanks a lot for your help! I am able to print our information including workload, time per host. This is convenient, great work! I ran pagerank with single GPU and 2 multi-node GPUs, I would expect close half of time on 2 GPUs comparing to single GPU runtime but I got similar runtime with single GPU and 2 GPUs. Is this normal or I did something wrong? (I tried all partition method and async, sync options, and tried on several social network graphs such as twitter, soc-LiveJournal1 and hollywood)

nicelhc13 commented 2 years ago

It is hard to answer it with only this information. But generally it should be scalable. Pleasse check Glun paper. It includes GPU scalability result. Please check and understand time breakdowns on the stat file.

It is possible that communication overhead outweighs computation distributions.

roshandathathri commented 2 years ago

Are you using GALOIS_DO_NOT_BIND_THREADS=1 mpirun --bind-to none?

YuxinxinChen commented 2 years ago

Are you using GALOIS_DO_NOT_BIND_THREADS=1 mpirun --bind-to none?

No, the command I use is: mpirun -n 2 $ROOT/lonestar/analytics/distributed/pagerank/pagerank-push-dist mygraph.gr --num_nodes=2 --partition=oec --pset=g --exec=Async. I tried Async and Sync and all partition options.

I am converting the twitter40 from the website: https://snap.stanford.edu/data/twitter-2010.html. I will try this dataset and see if it could get a better strong scaling performance. If not, I might do some wrong and I hope I could get help from you guys.

Thanks!

nicelhc13 commented 2 years ago

Could you please run with GALOIS_DO_NOT_BIND_THREADS=1 environment, as Roshan suggested? It could affect performances on distributed apps.

YuxinxinChen commented 2 years ago

I tried the flag GALOIS_DO_NOT_BIND_THREADS=1 on 2 multi-node GPU:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 563
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 688
STAT, 0, dGraph_Generic, GraphReading, HMAX, 213
STAT, 0, DistBench, GraphConstructTime, HMAX, 2205
STAT, 0, DistBench, TIMER_GRAPH_MARSHAL, HMAX, 1083
STAT, 0, PageRank, ResetGraph_0, HMAX, 1
STAT, 0, PageRank, PageRank_0, HMAX, 995
STAT, 0, PageRank, NumWorkItems_0, HSUM, 28967334
STAT, 0, PageRank, NumIterations_0, HOST_0, 56
STAT, 0, PageRank, Timer_0, HMAX, 1013
STAT, 0, PageRank, PageRank_1, HMAX, 965
STAT, 0, PageRank, NumWorkItems_1, HSUM, 28967334
STAT, 0, PageRank, NumIterations_1, HOST_0, 55
STAT, 0, PageRank, Timer_1, HMAX, 985
STAT, 0, PageRank, PageRank_2, HMAX, 964
STAT, 0, PageRank, NumWorkItems_2, HSUM, 28967334
STAT, 0, PageRank, NumIterations_2, HOST_0, 55
STAT, 0, PageRank, Timer_2, HMAX, 985
STAT, 0, PageRank, PageRank_3, HMAX, 965
STAT, 0, PageRank, NumWorkItems_3, HSUM, 28967334
STAT, 0, PageRank, NumIterations_3, HOST_0, 55
STAT, 0, PageRank, Timer_3, HMAX, 985
STAT, 0, PageRank, PageRank_4, HMAX, 964
STAT, 0, PageRank, NumWorkItems_4, HSUM, 28967334
STAT, 0, PageRank, NumIterations_4, HOST_0, 55
STAT, 0, PageRank, Timer_4, HMAX, 985
STAT, 0, PageRank, PageRank_5, HMAX, 963
STAT, 0, PageRank, NumWorkItems_5, HSUM, 28967334
STAT, 0, PageRank, NumIterations_5, HOST_0, 55
STAT, 0, PageRank, Timer_5, HMAX, 985
STAT, 0, PageRank, PageRank_6, HMAX, 965
STAT, 0, PageRank, NumWorkItems_6, HSUM, 28967334
STAT, 0, PageRank, NumIterations_6, HOST_0, 55
STAT, 0, PageRank, Timer_6, HMAX, 985
STAT, 0, PageRank, PageRank_7, HMAX, 964
STAT, 0, PageRank, NumWorkItems_7, HSUM, 28967334
STAT, 0, PageRank, NumIterations_7, HOST_0, 55
STAT, 0, PageRank, Timer_7, HMAX, 985
STAT, 0, PageRank, PageRank_8, HMAX, 965
STAT, 0, PageRank, NumWorkItems_8, HSUM, 28967334
STAT, 0, PageRank, NumIterations_8, HOST_0, 55
STAT, 0, PageRank, Timer_8, HMAX, 985
STAT, 0, PageRank, PageRank_9, HMAX, 965
STAT, 0, PageRank, NumWorkItems_9, HSUM, 28967334
STAT, 0, PageRank, NumIterations_9, HOST_0, 55
STAT, 0, PageRank, Timer_9, HMAX, 985
STAT, 0, PageRank, TimerTotal, HMAX, 13182
STAT, 0, Gluon, ReduceNumMessages_PageRank_0, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_1, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_2, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_3, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_4, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_5, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_6, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_7, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_8, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_9, HSUM, 0
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1
PARAM, 0, DistBench, CommandLine, HOST_0, /ccs/home/yuxinc/Galois/build/gcc-11.1-nvcc-11.4/lonestar/analytics/distributed/pagerank/pagerank-push-dist /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr --partition=oec --pset=g --runs=10 --exec=Async --tolerance=0.01 --graphTranspose=/gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.tgr
PARAM, 0, DistBench, Threads, HOST_0, 1
PARAM, 0, DistBench, Hosts, HOST_0, 2
PARAM, 0, DistBench, Runs, HOST_0, 10
PARAM, 0, DistBench, Run_UUID, HOST_0, d9dd8ff2-ad8f-4a5b-b6b1-1810266f41e9
PARAM, 0, DistBench, Input, HOST_0, /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, g35n12
PARAM, 0, PageRank, Max Iterations, HOST_0, 1000
PARAM, 0, PageRank, Tolerance, HOST_0, 0.01
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

I think the time is this: STAT, 0, PageRank, Timer_9, HMAX,, I averaged the time for 10 runs: 987.800000 ms.

Here is the single GPU run:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 820
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 812
STAT, 0, dGraph_Generic, GraphReading, HMAX, 199
STAT, 0, DistBench, GraphConstructTime, HMAX, 2580
STAT, 0, DistBench, TIMER_GRAPH_MARSHAL, HMAX, 1229
STAT, 0, PageRank, PageRank_0, HMAX, 1083
STAT, 0, PageRank, NumWorkItems_0, HSUM, 28967334
STAT, 0, PageRank, NumIterations_0, HOST_0, 55
STAT, 0, PageRank, Timer_0, HMAX, 1137
STAT, 0, PageRank, PageRank_1, HMAX, 1032
STAT, 0, PageRank, NumWorkItems_1, HSUM, 28967334
STAT, 0, PageRank, NumIterations_1, HOST_0, 55
STAT, 0, PageRank, Timer_1, HMAX, 1051
STAT, 0, PageRank, PageRank_2, HMAX, 1044
STAT, 0, PageRank, NumWorkItems_2, HSUM, 28967334
STAT, 0, PageRank, NumIterations_2, HOST_0, 55
STAT, 0, PageRank, Timer_2, HMAX, 1087
STAT, 0, PageRank, PageRank_3, HMAX, 1000
STAT, 0, PageRank, NumWorkItems_3, HSUM, 28967334
STAT, 0, PageRank, NumIterations_3, HOST_0, 55
STAT, 0, PageRank, Timer_3, HMAX, 1039
STAT, 0, PageRank, PageRank_4, HMAX, 1037
STAT, 0, PageRank, NumWorkItems_4, HSUM, 28967334
STAT, 0, PageRank, NumIterations_4, HOST_0, 55
STAT, 0, PageRank, Timer_4, HMAX, 1057
STAT, 0, PageRank, PageRank_5, HMAX, 1027
STAT, 0, PageRank, NumWorkItems_5, HSUM, 28967334
STAT, 0, PageRank, NumIterations_5, HOST_0, 55
STAT, 0, PageRank, Timer_5, HMAX, 1046
STAT, 0, PageRank, PageRank_6, HMAX, 1012
STAT, 0, PageRank, NumWorkItems_6, HSUM, 28967334
STAT, 0, PageRank, NumIterations_6, HOST_0, 55
STAT, 0, PageRank, Timer_6, HMAX, 1032
STAT, 0, PageRank, PageRank_7, HMAX, 1035
STAT, 0, PageRank, NumWorkItems_7, HSUM, 28967334
STAT, 0, PageRank, NumIterations_7, HOST_0, 55
STAT, 0, PageRank, Timer_7, HMAX, 1091
STAT, 0, PageRank, PageRank_8, HMAX, 1032
STAT, 0, PageRank, NumWorkItems_8, HSUM, 28967334
STAT, 0, PageRank, NumIterations_8, HOST_0, 55
STAT, 0, PageRank, Timer_8, HMAX, 1051
STAT, 0, PageRank, PageRank_9, HMAX, 1052
STAT, 0, PageRank, NumWorkItems_9, HSUM, 28967334
STAT, 0, PageRank, NumIterations_9, HOST_0, 55
STAT, 0, PageRank, Timer_9, HMAX, 1072
STAT, 0, PageRank, TimerTotal, HMAX, 14484
STAT, 0, Gluon, ReduceNumMessages_PageRank_0, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_1, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_2, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_3, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_4, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_5, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_6, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_7, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_8, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_9, HSUM, 0
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1
PARAM, 0, DistBench, CommandLine, HOST_0, /ccs/home/yuxinc/Galois/build/gcc-11.1-nvcc-11.4/lonestar/analytics/distributed/pagerank/pagerank-push-dist /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr --pset=g --runs=10 --tolerance=0.01 --graphTranspose=/gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.tgr
PARAM, 0, DistBench, Threads, HOST_0, 1
PARAM, 0, DistBench, Hosts, HOST_0, 1
PARAM, 0, DistBench, Runs, HOST_0, 10
PARAM, 0, DistBench, Run_UUID, HOST_0, 1fa420a5-fd01-4ca3-919a-2ceca2c72b0b
PARAM, 0, DistBench, Input, HOST_0, /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, b17n13
PARAM, 0, PageRank, Max Iterations, HOST_0, 1000
PARAM, 0, PageRank, Tolerance, HOST_0, 0.01
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

The average time is 1066.3 ms. The strong scaling number is 0.92. This is run on V100 GPUs

YuxinxinChen commented 2 years ago

I ran the twitter40, on a single GPU, the runtime is 11238.800000, and on 2 GPUs is 7248.000000 with oec partition. I feel this scaling makes more sense to me. Do you get similar perf?