Multicells GPU lineal solver

First of all, we should set the context of this issue. We will speak around the linear solver, but these concepts are applicable for the whole ODE solver.

Our objective in the implementation of a GPU linear solver is avoid all the synchronizations with the CPU (in other words, the data movement between CPU and GPU during the solving process).

This requires us to change the current implementation, where a CPU linear solver function calls multiple kernels in the GPU (each of this kernel calls a basic GPU function defined in our GPU library as a “global” function). This implementation calls the iterative loop in the CPU. While it’s not a problem to generate multiple kernels from the CPU side, the loop needs to check in every iteration the local error to decide if repeat the loop or the methods has converged. So, this error variable should be in the GPU to avoid communication with the CPU. Then, the loop itself should be in the GPU. In conclusion, we need to execute the whole iterative solver function in the GPU (as a “global” or “device” function).

So, we declare the iterative solver function as a “global” function. We also need to change the declaration of the library functions, since they are declared as a “global” function and a “global” function can only be called from the host. Then, we duplicate the library functions with a change of the definition from “global” to “device” to allow calls from the device.

Now, we have to deal with a new problem. The iterative function includes multiple functions of type “reduce”, where an array is reduced to a unique factor. This reduce function requires communication between the GPU blocks. This communication is costly, being able to consume more than the 50% of the overall execution time. Some studies have found alternatives to the simplest way of block communication (applying an “atomicadd” function). These alternatives can be summarized in the following techniques:

Using the “atomicadd” function as the barrier (Basic algorithm and GPU-tree based algorithm). The percentage of synchronization time is reduced between 5-10% for 3 different algorithm types.
Use the global memory as the barrier. The algorithm will stop the block execution until the global variable is updated with the value desired by the other blocks. The percentage of synchronization time is reduced up to 20%. The checking on the global variable can be substituted by semaphore layers, achieving up to 9.3 speedup from an “atomicadd” implementation, but risking the process to writing errors.
Synchronization with the CPU. This method uses the kernel launch as a global synchronization point. The method is the one recommended by Nvidia in his reduce tutorial, and also gives up to 6x speedup in other studies.

Option 3 is the fastest one, but is out of our scope since our objective is avoid synchronization with the CPU. Option two seems the most viable, but the extra consumption time in synchronization (up to 30%) will be notable for the GPU. This encourages us to develop an alternative specific for our type of problem: The multi-cells implementation for chemistry systems.

Our parallel implementation is parallelizing by the number of species, assigning one GPU thread per each species. A chemical mechanism is typically composed by much less species than the GPU threads available per block (less than a hundred of species in front of 1024 threads per block in each CUDA compute capability). Also, each cell computed in the multi-cells case is independent from each other. So, each cell fills the size of a block and doesn’t need to interact with the values of other cells. Thus, in theory we don’t need to perform any communication between blocks.

Working on this issue in develop_129_branch, also now studying how to implement the multi-cells idea properly

Multi-cells idea implemented

Taking into account these two factors:

1- Cells are independent from each other 2- Each cell is composed of less species than the maximum thread capacity per block in GPU.

We can assign each cell to a different GPU block, and compute in each block the linear solving algorithm. In this way, we avoid the communications with other GPU blocks (a costly GPU part). This concept is equivalent to launching multiple linear solver instances, each of them with only one cell, with the detail that these instances are located in independent blocks in the GPU.

Now that we explained the basic idea, we have to say that instead of computing only one cell per block, we will compute a group of cells in each block in order to use the maximum GPU threads available.

Notice that now the linear solving is avoiding the relations with cells from other blocks, but the rest of the ODE solving is still computing all the cells together. So, at the moment, the actual implementation can produce more iterations on the ODE algorithm, since our linear solver is now not taking into account all the cells contributions. However, in the future, the idea is to translate all the ODE algorithms to this idea of cell independence. This will improve the time execution and will approximate more the results to the original one-cell concept of solving the cells one by one.

Testing the new block-cells implementation

A test was developed with the objective of checking the accuracy error between the original multi-cells version and block-cells implementation:

Multi-cells: GPU linear solver with CPU synchronization on “reduce” operations.
Block-cells: Full GPU-only linear solver without any CPU communication, following the idea of cell independence (also called “block-cell” version for simplification). Each GPU block computes a group of cells and doesn’t communicate with any of the other blocks.

How do we define the accuracy error?

Max error: Maximum accuracy error between all the values in the output array of the linear solving method in the last ODE solving iteration.
Global max error at “ODE_iteration”: Maximum “Max error” over all the ODE iterations, located at iteration number “ODE_iteration”.
ODE iters: Total number of ODE iterations. Also known as calls to the linear solver function.
Average linear solver iters: Total number of internal linear solver iterations needed to convergence / ODE iters. Represents how many iterations needs the linear solver to converge per each ODE iteration. Example: an average of 2 means the linear solver takes two iterations to converge in each ODE iteration call.

Sum of the internal linear solver iterations divided by ODE iterations. Represents how many iterations needs the linear solver to converge in average with the total number of calls to the linear solver.

I configure multiple experiments to check different input cases. In these experiments, apart from testing the multi-cells and block-cells implementation, we also compare two instances of the multi-cells method, in order to:

1- Check if our test is correct. 2- Check if there are some divergence errors by default. The method has not any element of randomness (the initial values are configured at the same value and there any random variable set). But some divergence can come from a bad configuration that produces threads overlapping, where sometimes a thread reaches a variable before the other and changes the results. This last case sometimes is difficult to detect, and this test will help us with it.

Common test configuration:

GPU: On
Linear solver method: Biconjugate Gradient
Linear solver convergence error : e-10

Configuration 1:

Test: monarch_1
Cells: 10,000
Species per cell: 3
Cells values: Slightly different between them

Results:

Comparing Multi-cells vs Multi-cells:

Max error: 0
Global max error at 0: 0
ODE iters: 267
Average linear solver iters (Multi-cells case): 1.19

The accuracy error was zero (results were exactly the same). So our test is correct and there are no divergence errors.

Comparing Multi-cells vs Block-cells:

Max error: 1.096079e-14
Global max error at 10: 6.026704e-12
ODE iters: 264
Average linear solver iters (Block-cells case): 1.22

Seems there is some divergence between the two implementations (maximum of e-12 error in the output values for all the ODE iterations). These divergence should be produced by the change of definition with cell independence idea (now we are communicating only cells in the same GPU block, instead of all blocks, so the contributions from other blocks will be lost). However, the error is less than the convergence error defined in the linear solver (e-10). Also, this error is reduced when increasing the maximum error required in the linear solver to converge (e.g increasing to e-13 set the max error to e-15).

Configuration 2:

Test: cb05cl_ae_big
Cells: 10,000
Species per cell: 76
Cells values: Slightly different between them

Results:

Comparing Multi-cells vs Multi-cells:

Max error: 8.131516e-20
Global max error at 0: 8.131516e-20
ODE iters: 1
Average linear solver iters (Multi-cells case): 6

A small divergence error appears (e-20). Taking into account that from configuration 1 we deduce that our test is correct, probably there are some threads overlapping that modifies the results slightly. Notice the input in this test contains notably more species per cell than mock_monarch_1 (76 in cb05cl_ae_big and 3 in monarch1), so is more sensible to thread overlapping.

Comparing Multi-cells vs Block-cells:

Max error: 1.508556e-11
Global max error at 0: 1.508556e-11
ODE iters: 1
Average linear solver iters (Block-cells case): 100

The average linear solver iters are increased by ~16 (from 6 to 100). Seems more complex systems result in more iterations, so the communications between other cells affects more the values in this case.

On the other hand, the error is bigger than configuration 1 (e-12), but is still under the linear solver error configured (e-10). So, the conclusion is the same: the error is produced from less communication between blocks but it’s under the limit configured.

Configuration 3:

Test: cb05cl_ae_big
Cells: 10,000
Species per cell: 76
Cells values: Big difference between them. Only cell 1 is initialized to the multi-cells input values, the other cells are zero at the start.

Results:

Comparing Multi-cells vs Multi-cells:

Max error: 2.193598e-14
Global max error at 0: 2.193598e-14
ODE iters: 1
Average linear solver iters (Multi-cells case): 9

Now the error is increased. This means that the system is more sensible to thread overlapping when there is more divergence in the initial conditions.

Comparing Multi-cells vs Block-cells:

Max error: 6.598307e-11
Global max error at 0: 1.884863e+00
ODE iters: 7
Average linear solver iters (Block-cells case): 80.86

Notice the first iteration reaches a big error (e+0). But this error is corrected in next iterations reaching e-11 at last iteration. So, when the divergence between cells is hard, the block-cells method differs a lot from the multi-cells, but the rest of the ODE algorithm corrects this error in later iterations ending in similar results.

We want to remark also that the number of ODE solving iterations are increased in this test. So, when the divergence is hard, the block-cells method differs notably, especially in the first iteration. But the ODE algorithm doesn’t tolerate this difference, resulting in more ODE iterations and correcting this divergence.

Final conclusions

1- There are some threads overlapping, we can classify it as a “bug”. We will report this bug to Guillermo (who shared with us the GPU library) in order to find the source of the problem. 2- The block-cells method does not give an error superior to the limit configured, except if there is a big difference between cells. In any case, this only ends in more ODE solving iterations and ends in an accuracy error under our accepted level, so it can be applied until the rest of the ODE solving is adapted to the cells independence idea. Moreover, the number of these ODE iterations can be reduced with a proper preconditioner for these high cell divergence cases.

Profiling

Following the configuration mentioned before, I profiled the new and old linear solver method. Also, I change a bit the name convention:

Biconjugate Gradient: Linear solver used.
Basic test: monarch_1 test
Block_cells: New implementation. Full GPU-only linear solver without any CPU communication, following the idea of cell independence (now called block-. Each GPU block computes a group of cells and doesn’t communicate with any of the other blocks.
Multi-cells: Old implementation. GPU linear solver with CPU synchronization on “reduce” operations. Each block-cell uses values from other block-cells in the calculation, treating all the systems as a "unique" structure.

Results:

We can see that the speedup is positive for all the number of cells, but decreases when using a large number of cells (from 10.000 to 100.000). This is produced because the bottleneck for these big numbers of cells is the memory and instead of the communications. Also, in these measures there are still some syncrhonization between blocks (more info in #141). All these factors make the speedup achieved not so remarkable in these cases.

The next image shows the time execution percentage for the most relevant functions during the ODE solving. Description of functions:

Others: Other operations composed by either checking, simple matrix operations etc. These "other" sections are composed of multiple lines that are not part of any function and consume little time execution.
Initializations: Initializations done before the first step of ODE solving. Composed by checkings or setups (e.g. setup of variables that depends on "dt").
Jac: Jacobian function, computed in CAMP.
BiConjGrad: GPU linear solving, following Biconjugate gradient algorithm.
DerivSolve: Derivative function from CAMP computed during the solving process
LinSetup: Setup of the Linear solver. Composed by Jacobian function and some other minor operations like saving a copy of the Jac matrix.
DerivNewton: Derivative function from CAMP computed during the NewtonIt function.
LinSolve: Process of linear solving. Composed by the call to linear solving, the Derivative function and some extra operations and checkings as part of the ODE solving method.
NewtonIt: Function that executes the routine "Newton Iteration" of the BDF ODE algorithm. Composed by "LinSetup", "DerivNewton", "DerivNewton" and other minor functions.
cvStep: Main function of ODE solving. Composed by all other functions except initializations.

As we can see, the Deriv function takes a lot of the execution time (34.22% when called during NewtonIt and 10.47% when called from LinSolve). The optimization of the derivative function will be done after merging my branch with the last changes in the chem_mod branch, since there are some changes on the calculation of Derivative and Jac.

For the linear solving part, after all our optimizations we can see that it only takes a 1.88% of the total time execution. So, it's not now a big issue.

Thus, without taking into account the Derivative, Jac and BiConjGrad, the most interesting part is optimizing the "others' ' functions. We will not take into account at the moment the "Others' ' part from the cvStep function, since this part derives in a lot of functions less suitable to compute in the GPU (few parallel compute possible). Instead, the "others'' from the rest of the functions are already using some GPU functions and are suitable to compute using the "Block-cells' ' strategy. They are taking 16% of total time execution, so it's a reasonable part to improve if we can achieve a speedup similar to linear solving (~7x).

As an extra, I will let here a comparison with the CPU solving method to evaluate the changes:

We can see that the time percentage of the Linear solver has been reduced from 14.71% in CPU with KLU Sparse method to 1.88%. Also, the “NewtonIt” function (which englobes most of the GPU functions) in GPU is taking only 18.75%, in front of the 30.39% consumption in GPU.

The next image shows the speedup of the CPU KLU Sparse linear solving in front of the GPU Block-cells Biconjugate gradient.:

We can see that the speedup increases linearly with the number of cells, ending in a ~27x speedup for 100,000 cells.

To finalize, we want to show the actual speedup of using all the CAMP module in CPU in front of using it with the current GPU implementation.

We can see that the overall speedup is not so high (2x speedup infront of the 27x on the linear solving), since the linear solving only takes 14.71% of the total time execution.

Some other graphs that show the speedup of using the GPU biconjugate Gradient in front of the original CPU KLU Sparse. In this case, the graphs are more general and show the number of equations instead of the number of cells (calculated as number of species * number of cells)

Some functions like the reduce operation can be slightly optimized by using CUDA samples (check reduce6 and reduce7)

compdyn / partmc

Multicells GPU lineal solver #139