Onotate commented 3 months ago

Proposal Issue: #2 (This issue also contains the performance metrics of the original code)

Code changes:

Modify the type of adjacency_matrix from double to short int.
Uses memset and memcpy to replace array initialization and copy loops.
Modified the page rank update loop to doubly nested instead of triply nested.

Further explanation of the changes and performance metrics will be posted in later comments.

Onotate commented 3 months ago

Explanation and Metrics of the changes

Change 1: Using `memset` and `memcpy` instead of loops for array initialization and copy

This change creates no effect other than shorter code. The original thought to this change is that using a library function instead of writing a loop could lead to a performance gain; however, this may not apply to C or the compiler did some optimization leg work.

Change 2: Replaced triply nested update loop

Here is the original page rank update loop:

for(int i = 0; i < GRAPH_ORDER; i++)
{
    for(int j = 0; j < GRAPH_ORDER; j++)
    {
        if (adjacency_matrix[j][i] == 1.0)
        {
            int outdegree = 0;

            for(int k = 0; k < GRAPH_ORDER; k++)
            {
                if (adjacency_matrix[j][k] == 1.0)
                {
                    outdegree++;
                }
            }
            new_pagerank[i] += pagerank[j] / (double)outdegree;
        }
    }
}

This code here uses a triply nested for loop and also accessing the graph is a column order. The new code change aims to improve that:

for (int i = 0; i < GRAPH_ORDER; i++) {
      int outdegree = 0;

      for (int k = 0; k < GRAPH_ORDER; k++)
        if (adjacency_matrix[i][k] == 1) outdegree++;

      for (int j = 0; j < GRAPH_ORDER; j++)
        if (adjacency_matrix[i][j] == 1)
          new_pagerank[j] += pagerank[i] / (double)outdegree;
}

Here, there is only 2 levels of nested loops and the array access is in a row order.

This changes result in the biggest performance improvement to the original code. Here are the iterations completed when using the nice graph:

Original code: 21 iterations.
Code with new update loop, still using double: 1195 iterations.

Change 3: Using `short int` over `double` for the adjacency matrix

Since the graph is unweighted, the adjacency matrix does not need to store values other than 1's and 0's, so the use of double type is unnecessary and can slow down comparisons.

The earliest commit changes the type of the adjacency matrix within the original code from double to boolean from stdbool.h, but this results in a performance decrease. These are number of iterations completed by the original code on the nice graph:

Original code, using double: 21 iterations.
Original code, using boolean: 12 iterations.

The latest commit to this branch chose short int as the data type of choice after comparing performance of double, int, char, boolean, and short int. Here are the number of iterations completed using the nice graph:

Data type	Iterations completed
`boolean`	1155
`double`	1195
`int`	1233
`short int`	1239

Final Performance Report

Using generate_nice_graph:

0.00 seconds to generate the graph.
1239 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.001000
PageRank of vertex 100: 0.001000
PageRank of vertex 200: 0.001000
PageRank of vertex 300: 0.001000
PageRank of vertex 400: 0.001000
PageRank of vertex 500: 0.001000
PageRank of vertex 600: 0.001000
PageRank of vertex 700: 0.001000
PageRank of vertex 800: 0.001000
PageRank of vertex 900: 0.001000
Sum of all pageranks = 1.000000000000, total diff = 0.000000000000, max diff = 0.000000000000 and min diff = 0.000000000000.
Total time taken: 9.9995 seconds.

Using generate_sneaky_graph:

0.00 seconds to generate the graph.
2223 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.002541
PageRank of vertex 100: 0.001726
PageRank of vertex 200: 0.001494
PageRank of vertex 300: 0.001302
PageRank of vertex 400: 0.001127
PageRank of vertex 500: 0.000962
PageRank of vertex 600: 0.000800
PageRank of vertex 700: 0.000641
PageRank of vertex 800: 0.000483
PageRank of vertex 900: 0.000322
Sum of all pageranks = 1.000000000000, total diff = 1.168591825342, max diff = 0.624703182089 and min diff = 0.000000000000.
Total time taken: 9.9972 seconds.

Onotate commented 3 months ago

Update on the latest change to the page rank update loop

The previous update loop goes through each source and update the page ranks of its outgoing destination. This is not friendly for parallelization, since multiple sources can have the same destination, potentially making the page rank of the destination a critical region.

The new update loop goes through each destination and update its page rank using incoming sources. This enables multiple destinations to update their page rank in parallel. The outgoing degrees of all nodes are able to be pre-calculated ahead since the graph does not changed. Here is the new update loop:

for (int i = 0; i < GRAPH_ORDER; i++)
    for (int j = 0; j < GRAPH_ORDER; j++)
        if (adjacency_matrix[j][i] == 1)
            new_pagerank[i] += pagerank[j] / (double)outdegrees[j];

This new update loop does access the array in a column major order, which is not cache friendly. We can mitigate this by working on a transposed adjacency_matrix (either initialized the matrix in the transposed form, or transpose before calculation), but the performance improvement may be minimal.

The performance impact of this new update loop is minimal:

Using generate_nice_graph:

0.00 seconds to generate the graph.
1267 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.001000
PageRank of vertex 100: 0.001000
PageRank of vertex 200: 0.001000
PageRank of vertex 300: 0.001000
PageRank of vertex 400: 0.001000
PageRank of vertex 500: 0.001000
PageRank of vertex 600: 0.001000
PageRank of vertex 700: 0.001000
PageRank of vertex 800: 0.001000
PageRank of vertex 900: 0.001000
Sum of all pageranks = 1.000000000000, total diff = 0.000000000000, max diff = 0.000000000000 and min diff = 0.000000000000.
Total time taken: 9.9992 seconds.

Using generate_sneaky_graph:

0.00 seconds to generate the graph.
2221 iterations achieved in 10.00 seconds
PageRank of vertex 0: 0.002541
PageRank of vertex 100: 0.001726
PageRank of vertex 200: 0.001494
PageRank of vertex 300: 0.001302
PageRank of vertex 400: 0.001127
PageRank of vertex 500: 0.000962
PageRank of vertex 600: 0.000800
PageRank of vertex 700: 0.000641
PageRank of vertex 800: 0.000483
PageRank of vertex 900: 0.000322
Sum of all pageranks = 1.000000000000, total diff = 1.168591825342, max diff = 0.624703182089 and min diff = 0.000000000000.
Total time taken: 9.9991 seconds.

pedramhaqiqi commented 3 months ago

LGTM 🔥

Onotate / IHPCSS_challenges

Serial code memory improvement #3

Explanation and Metrics of the changes

Change 1: Using `memset` and `memcpy` instead of loops for array initialization and copy

Change 2: Replaced triply nested update loop

Change 3: Using `short int` over `double` for the adjacency matrix

Final Performance Report

Update on the latest change to the page rank update loop

Onotate / IHPCSS_challenges

Serial code memory improvement #3

Explanation and Metrics of the changes

Change 1: Using memset and memcpy instead of loops for array initialization and copy

Change 2: Replaced triply nested update loop

Change 3: Using short int over double for the adjacency matrix

Final Performance Report

Update on the latest change to the page rank update loop

Change 1: Using `memset` and `memcpy` instead of loops for array initialization and copy

Change 3: Using `short int` over `double` for the adjacency matrix