compdyn / partmc

Particle-resolved stochastic atmospheric aerosol model
http://lagrange.mechse.illinois.edu/partmc/
GNU General Public License v2.0
27 stars 15 forks source link

Multicells GPU lineal solver #139

Open cguzman95 opened 4 years ago

cguzman95 commented 4 years ago

First of all, we should set the context of this issue. We will speak around the linear solver, but these concepts are applicable for the whole ODE solver.

Our objective in the implementation of a GPU linear solver is avoid all the synchronizations with the CPU (in other words, the data movement between CPU and GPU during the solving process).

This requires us to change the current implementation, where a CPU linear solver function calls multiple kernels in the GPU (each of this kernel calls a basic GPU function defined in our GPU library as a “global” function). This implementation calls the iterative loop in the CPU. While it’s not a problem to generate multiple kernels from the CPU side, the loop needs to check in every iteration the local error to decide if repeat the loop or the methods has converged. So, this error variable should be in the GPU to avoid communication with the CPU. Then, the loop itself should be in the GPU. In conclusion, we need to execute the whole iterative solver function in the GPU (as a “global” or “device” function).

So, we declare the iterative solver function as a “global” function. We also need to change the declaration of the library functions, since they are declared as a “global” function and a “global” function can only be called from the host. Then, we duplicate the library functions with a change of the definition from “global” to “device” to allow calls from the device.

Now, we have to deal with a new problem. The iterative function includes multiple functions of type “reduce”, where an array is reduced to a unique factor. This reduce function requires communication between the GPU blocks. This communication is costly, being able to consume more than the 50% of the overall execution time. Some studies have found alternatives to the simplest way of block communication (applying an “atomicadd” function). These alternatives can be summarized in the following techniques:

  1. Using the “atomicadd” function as the barrier (Basic algorithm and GPU-tree based algorithm). The percentage of synchronization time is reduced between 5-10% for 3 different algorithm types.
  2. Use the global memory as the barrier. The algorithm will stop the block execution until the global variable is updated with the value desired by the other blocks. The percentage of synchronization time is reduced up to 20%. The checking on the global variable can be substituted by semaphore layers, achieving up to 9.3 speedup from an “atomicadd” implementation, but risking the process to writing errors.
  3. Synchronization with the CPU. This method uses the kernel launch as a global synchronization point. The method is the one recommended by Nvidia in his reduce tutorial, and also gives up to 6x speedup in other studies.

Option 3 is the fastest one, but is out of our scope since our objective is avoid synchronization with the CPU. Option two seems the most viable, but the extra consumption time in synchronization (up to 30%) will be notable for the GPU. This encourages us to develop an alternative specific for our type of problem: The multi-cells implementation for chemistry systems.

Our parallel implementation is parallelizing by the number of species, assigning one GPU thread per each species. A chemical mechanism is typically composed by much less species than the GPU threads available per block (less than a hundred of species in front of 1024 threads per block in each CUDA compute capability). Also, each cell computed in the multi-cells case is independent from each other. So, each cell fills the size of a block and doesn’t need to interact with the values of other cells. Thus, in theory we don’t need to perform any communication between blocks.

cguzman95 commented 4 years ago

Working on this issue in develop_129_branch, also now studying how to implement the multi-cells idea properly

cguzman95 commented 4 years ago

Multi-cells idea implemented

Taking into account these two factors:

1- Cells are independent from each other 2- Each cell is composed of less species than the maximum thread capacity per block in GPU.

We can assign each cell to a different GPU block, and compute in each block the linear solving algorithm. In this way, we avoid the communications with other GPU blocks (a costly GPU part). This concept is equivalent to launching multiple linear solver instances, each of them with only one cell, with the detail that these instances are located in independent blocks in the GPU.

Now that we explained the basic idea, we have to say that instead of computing only one cell per block, we will compute a group of cells in each block in order to use the maximum GPU threads available.

Notice that now the linear solving is avoiding the relations with cells from other blocks, but the rest of the ODE solving is still computing all the cells together. So, at the moment, the actual implementation can produce more iterations on the ODE algorithm, since our linear solver is now not taking into account all the cells contributions. However, in the future, the idea is to translate all the ODE algorithms to this idea of cell independence. This will improve the time execution and will approximate more the results to the original one-cell concept of solving the cells one by one.

cguzman95 commented 4 years ago

Testing the new block-cells implementation

A test was developed with the objective of checking the accuracy error between the original multi-cells version and block-cells implementation:

How do we define the accuracy error?

Sum of the internal linear solver iterations divided by ODE iterations. Represents how many iterations needs the linear solver to converge in average with the total number of calls to the linear solver.

I configure multiple experiments to check different input cases. In these experiments, apart from testing the multi-cells and block-cells implementation, we also compare two instances of the multi-cells method, in order to:

1- Check if our test is correct. 2- Check if there are some divergence errors by default. The method has not any element of randomness (the initial values are configured at the same value and there any random variable set). But some divergence can come from a bad configuration that produces threads overlapping, where sometimes a thread reaches a variable before the other and changes the results. This last case sometimes is difficult to detect, and this test will help us with it.

Common test configuration:

Configuration 1:

Results:

Comparing Multi-cells vs Multi-cells:

The accuracy error was zero (results were exactly the same). So our test is correct and there are no divergence errors.

Comparing Multi-cells vs Block-cells:

Seems there is some divergence between the two implementations (maximum of e-12 error in the output values for all the ODE iterations). These divergence should be produced by the change of definition with cell independence idea (now we are communicating only cells in the same GPU block, instead of all blocks, so the contributions from other blocks will be lost). However, the error is less than the convergence error defined in the linear solver (e-10). Also, this error is reduced when increasing the maximum error required in the linear solver to converge (e.g increasing to e-13 set the max error to e-15).

Configuration 2:

Results:

Comparing Multi-cells vs Multi-cells:

A small divergence error appears (e-20). Taking into account that from configuration 1 we deduce that our test is correct, probably there are some threads overlapping that modifies the results slightly. Notice the input in this test contains notably more species per cell than mock_monarch_1 (76 in cb05cl_ae_big and 3 in monarch1), so is more sensible to thread overlapping.

Comparing Multi-cells vs Block-cells:

The average linear solver iters are increased by ~16 (from 6 to 100). Seems more complex systems result in more iterations, so the communications between other cells affects more the values in this case.

On the other hand, the error is bigger than configuration 1 (e-12), but is still under the linear solver error configured (e-10). So, the conclusion is the same: the error is produced from less communication between blocks but it’s under the limit configured.

Configuration 3:

Results:

Comparing Multi-cells vs Multi-cells:

Now the error is increased. This means that the system is more sensible to thread overlapping when there is more divergence in the initial conditions.

Comparing Multi-cells vs Block-cells:

Notice the first iteration reaches a big error (e+0). But this error is corrected in next iterations reaching e-11 at last iteration. So, when the divergence between cells is hard, the block-cells method differs a lot from the multi-cells, but the rest of the ODE algorithm corrects this error in later iterations ending in similar results.

We want to remark also that the number of ODE solving iterations are increased in this test. So, when the divergence is hard, the block-cells method differs notably, especially in the first iteration. But the ODE algorithm doesn’t tolerate this difference, resulting in more ODE iterations and correcting this divergence.

Final conclusions

1- There are some threads overlapping, we can classify it as a “bug”. We will report this bug to Guillermo (who shared with us the GPU library) in order to find the source of the problem. 2- The block-cells method does not give an error superior to the limit configured, except if there is a big difference between cells. In any case, this only ends in more ODE solving iterations and ends in an accuracy error under our accepted level, so it can be applied until the rest of the ODE solving is adapted to the cells independence idea. Moreover, the number of these ODE iterations can be reduced with a proper preconditioner for these high cell divergence cases.

cguzman95 commented 4 years ago

Profiling

Following the configuration mentioned before, I profiled the new and old linear solver method. Also, I change a bit the name convention:

Results: image

We can see that the speedup is positive for all the number of cells, but decreases when using a large number of cells (from 10.000 to 100.000). This is produced because the bottleneck for these big numbers of cells is the memory and instead of the communications. Also, in these measures there are still some syncrhonization between blocks (more info in #141). All these factors make the speedup achieved not so remarkable in these cases.

The next image shows the time execution percentage for the most relevant functions during the ODE solving. Description of functions:

image

As we can see, the Deriv function takes a lot of the execution time (34.22% when called during NewtonIt and 10.47% when called from LinSolve). The optimization of the derivative function will be done after merging my branch with the last changes in the chem_mod branch, since there are some changes on the calculation of Derivative and Jac.

For the linear solving part, after all our optimizations we can see that it only takes a 1.88% of the total time execution. So, it's not now a big issue.

Thus, without taking into account the Derivative, Jac and BiConjGrad, the most interesting part is optimizing the "others' ' functions. We will not take into account at the moment the "Others' ' part from the cvStep function, since this part derives in a lot of functions less suitable to compute in the GPU (few parallel compute possible). Instead, the "others'' from the rest of the functions are already using some GPU functions and are suitable to compute using the "Block-cells' ' strategy. They are taking 16% of total time execution, so it's a reasonable part to improve if we can achieve a speedup similar to linear solving (~7x).

As an extra, I will let here a comparison with the CPU solving method to evaluate the changes:

image

We can see that the time percentage of the Linear solver has been reduced from 14.71% in CPU with KLU Sparse method to 1.88%. Also, the “NewtonIt” function (which englobes most of the GPU functions) in GPU is taking only 18.75%, in front of the 30.39% consumption in GPU.

The next image shows the speedup of the CPU KLU Sparse linear solving in front of the GPU Block-cells Biconjugate gradient.:

image

We can see that the speedup increases linearly with the number of cells, ending in a ~27x speedup for 100,000 cells.

To finalize, we want to show the actual speedup of using all the CAMP module in CPU in front of using it with the current GPU implementation.

image

We can see that the overall speedup is not so high (2x speedup infront of the 27x on the linear solving), since the linear solving only takes 14.71% of the total time execution.

cguzman95 commented 4 years ago

Some other graphs that show the speedup of using the GPU biconjugate Gradient in front of the original CPU KLU Sparse. In this case, the graphs are more general and show the number of equations instead of the number of cells (calculated as number of species * number of cells)

image

image

image

cguzman95 commented 3 years ago

Some functions like the reduce operation can be slightly optimized by using CUDA samples (check reduce6 and reduce7)