Closed Onotate closed 3 months ago
memset
and memcpy
instead of loops for array initialization and copyThis change creates no effect other than shorter code. The original thought to this change is that using a library function instead of writing a loop could lead to a performance gain; however, this may not apply to C or the compiler did some optimization leg work.
Here is the original page rank update loop:
for(int i = 0; i < GRAPH_ORDER; i++)
{
for(int j = 0; j < GRAPH_ORDER; j++)
{
if (adjacency_matrix[j][i] == 1.0)
{
int outdegree = 0;
for(int k = 0; k < GRAPH_ORDER; k++)
{
if (adjacency_matrix[j][k] == 1.0)
{
outdegree++;
}
}
new_pagerank[i] += pagerank[j] / (double)outdegree;
}
}
}
This code here uses a triply nested for loop and also accessing the graph is a column order. The new code change aims to improve that:
for (int i = 0; i < GRAPH_ORDER; i++) {
int outdegree = 0;
for (int k = 0; k < GRAPH_ORDER; k++)
if (adjacency_matrix[i][k] == 1) outdegree++;
for (int j = 0; j < GRAPH_ORDER; j++)
if (adjacency_matrix[i][j] == 1)
new_pagerank[j] += pagerank[i] / (double)outdegree;
}
Here, there is only 2 levels of nested loops and the array access is in a row order.
This changes result in the biggest performance improvement to the original code. Here are the iterations completed when using the nice graph:
double
: 1195 iterations.short int
over double
for the adjacency matrixSince the graph is unweighted, the adjacency matrix does not need to store values other than 1's and 0's, so the use of double
type is unnecessary and can slow down comparisons.
The earliest commit changes the type of the adjacency matrix within the original code from double
to boolean
from stdbool.h
, but this results in a performance decrease. These are number of iterations completed by the original code on the nice graph:
double
: 21 iterations.boolean
: 12 iterations.The latest commit to this branch chose short int
as the data type of choice after comparing performance of double
, int
, char
, boolean
, and short int
. Here are the number of iterations completed using the nice graph:
Data type | Iterations completed |
---|---|
boolean |
1155 |
double |
1195 |
int |
1233 |
short int |
1239 |
Using generate_nice_graph
:
0.00 seconds to generate the graph.
1239 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.001000
PageRank of vertex 100: 0.001000
PageRank of vertex 200: 0.001000
PageRank of vertex 300: 0.001000
PageRank of vertex 400: 0.001000
PageRank of vertex 500: 0.001000
PageRank of vertex 600: 0.001000
PageRank of vertex 700: 0.001000
PageRank of vertex 800: 0.001000
PageRank of vertex 900: 0.001000
Sum of all pageranks = 1.000000000000, total diff = 0.000000000000, max diff = 0.000000000000 and min diff = 0.000000000000.
Total time taken: 9.9995 seconds.
Using generate_sneaky_graph
:
0.00 seconds to generate the graph.
2223 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.002541
PageRank of vertex 100: 0.001726
PageRank of vertex 200: 0.001494
PageRank of vertex 300: 0.001302
PageRank of vertex 400: 0.001127
PageRank of vertex 500: 0.000962
PageRank of vertex 600: 0.000800
PageRank of vertex 700: 0.000641
PageRank of vertex 800: 0.000483
PageRank of vertex 900: 0.000322
Sum of all pageranks = 1.000000000000, total diff = 1.168591825342, max diff = 0.624703182089 and min diff = 0.000000000000.
Total time taken: 9.9972 seconds.
The previous update loop goes through each source and update the page ranks of its outgoing destination. This is not friendly for parallelization, since multiple sources can have the same destination, potentially making the page rank of the destination a critical region.
The new update loop goes through each destination and update its page rank using incoming sources. This enables multiple destinations to update their page rank in parallel. The outgoing degrees of all nodes are able to be pre-calculated ahead since the graph does not changed. Here is the new update loop:
for (int i = 0; i < GRAPH_ORDER; i++)
for (int j = 0; j < GRAPH_ORDER; j++)
if (adjacency_matrix[j][i] == 1)
new_pagerank[i] += pagerank[j] / (double)outdegrees[j];
This new update loop does access the array in a column major order, which is not cache friendly. We can mitigate this by working on a transposed adjacency_matrix
(either initialized the matrix in the transposed form, or transpose before calculation), but the performance improvement may be minimal.
The performance impact of this new update loop is minimal:
generate_nice_graph
:
0.00 seconds to generate the graph.
1267 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.001000
PageRank of vertex 100: 0.001000
PageRank of vertex 200: 0.001000
PageRank of vertex 300: 0.001000
PageRank of vertex 400: 0.001000
PageRank of vertex 500: 0.001000
PageRank of vertex 600: 0.001000
PageRank of vertex 700: 0.001000
PageRank of vertex 800: 0.001000
PageRank of vertex 900: 0.001000
Sum of all pageranks = 1.000000000000, total diff = 0.000000000000, max diff = 0.000000000000 and min diff = 0.000000000000.
Total time taken: 9.9992 seconds.
generate_sneaky_graph
:
0.00 seconds to generate the graph.
2221 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.002541
PageRank of vertex 100: 0.001726
PageRank of vertex 200: 0.001494
PageRank of vertex 300: 0.001302
PageRank of vertex 400: 0.001127
PageRank of vertex 500: 0.000962
PageRank of vertex 600: 0.000800
PageRank of vertex 700: 0.000641
PageRank of vertex 800: 0.000483
PageRank of vertex 900: 0.000322
Sum of all pageranks = 1.000000000000, total diff = 1.168591825342, max diff = 0.624703182089 and min diff = 0.000000000000.
Total time taken: 9.9991 seconds.
LGTM 🔥
Proposal Issue: #2 (This issue also contains the performance metrics of the original code)
Code changes:
adjacency_matrix
fromdouble
toshort int
.memset
andmemcpy
to replace array initialization and copy loops.Further explanation of the changes and performance metrics will be posted in later comments.