compdyn / partmc

Particle-resolved stochastic atmospheric aerosol model
http://lagrange.mechse.illinois.edu/partmc/
GNU General Public License v2.0
28 stars 16 forks source link

Discussion: merge gpu and cpu code? #129

Open cguzman95 opened 4 years ago

cguzman95 commented 4 years ago

Hi @mattldawson

I was wondering about the extra work on modifying the GPU code each time we modify something on the reactions code. CUDA code for reactions are, mostly, a copy of the CPU code (except for the function declaration and the atomicadd to save the results in deriv/Jac). So each time we modify the code on CPU we must copy the changes in CUDA and taking care of don't delete the GPU implementations.

So, I managed to find a possible solution. We can profit from the PMC_USE_GPU flag to choose to compile CPU code or GPU code since we will use only one of the cases. To illustrate the idea, the code will be something like this:

#ifdef PMC_USE_GPU
__device__
#endif
void rxn_aqueous_equilibrium_calc_deriv_contrib(...

...

#ifdef PMC_USE_GPU
atomicadd(deriv)

#else 

deriv+=...  

#endif
...
}

I already tested it and works, so it only left to discuss this. I like the idea if we will be doing changes frequently to the reactions. Also, it is a good idea in order to explain to the user how to add a reaction, since it won't be so "strange" as say to the user if he wants to compute his new reaction on a GPU, he needs to copy/paste the code.

However, I understand that this can be a bit "ugly" , maybe mess up a bit the code and I also like having all the CUDA code in his own folder. So, what do you think?

mattldawson commented 4 years ago

I think this is a great idea. Reducing duplicate code is always good, and I think this isn't "ugly" at all really - seems like the best possible implementation. How/when would you like to implement this? I'm still working on the MONARCH implementation (which is moving pretty well). After that, I can help if needed.

cguzman95 commented 4 years ago

I can start it on Monday. I will push first the pull request of #126, and then continue with this, since it sounds good for you too.

cguzman95 commented 4 years ago

Note: Conflict between aero_rep_get_number_conc and aero_phase_get_volume from HL & SIMPOL reaction types: Both reactions calls on calc_deriv the function aero_rep_get_number_concwhich calls function aero_phase_get_volume. This last function is also called for cpu functions apart from calc_deriv like aero_rep_modal_binned_mass_update_state, so, to compile calc_deriv of HL & SIMPOL on gpu we need to broke this dependance first (or wait till we discuss about aero_rep and other reactions reordering and re-structuring of data)

cguzman95 commented 4 years ago

Note: realtype is not detected as a type on GPU compilation, raising an error. It's okay if I replace realtype for double at least for the moment?

mattldawson commented 4 years ago

Hi @cguzman95 - I think it would be better if we found solutions to these problems now. I don't want to start introducing a lot of patches and work-arounds into the code just to get something to run, even if it takes a little longer in the design phase. For the aero_phase_get_volume is the problem that you need to add the __device__ flag to access the function on the GPUs but that somehow makes it unavailable for calls from the CPU?

mattldawson commented 4 years ago

For the realtype we should discuss a plan to handle precision that works with CVODE and the GPUs. Do you know what the specific problem is using realtype on the GPUs?

cguzman95 commented 4 years ago

About the first question: Yes, if a function is declared with __device__ it can't be called from CPU (and there is a case that a function is called from CPU and from GPU). I can propose a solution better than not computing these reactions in GPU (I need first to check better the code, but shouldn't be a problem)

About the second question: I will investigate further that and solutions. If I remember well from when I was investigating this, the main problem was the atomicadd function (it needs to pass a known type like double or float, but I think that exists a more "general" function than atomicadd that allows all types... not sure now)

mattldawson commented 4 years ago

Hi @cguzman95 - that sounds great. For the realtype, I'm fine with using double, but let's come up with a good plan of where in the code to switch from realtype to double. CVODE uses realtype and can be configured to use float instead of double, and in the future we may use something other than CVODE for the solver, so I'm thinking the switch to double could be somewhere close to the interface with CVODE. What do you think? We could also think about if and how we want to be able to use different precision in our code (I don't know what the best way to implement this would be)

mattldawson commented 4 years ago

for the __device__ function, that's great if you can propose something. Worst case we can make two functions, but maybe there's something we could do at compile time to use the same code for both or something? or, maybe you find a better solution.

cguzman95 commented 4 years ago

For the realtype - yeah could be nice, however, it should be done in the distant future, so I don't mind doing all as double and later do a refactor to realtype. But yeah, if you want to deal first with this kind of problem, I won't disagree.

For the second: Now that I google a bit, seems there are nice solutions https://codeyarns.com/2011/03/14/cuda-common-function-for-both-host-and-device-code/

mattldawson commented 4 years ago

That looks like a perfect solution. For the realtype that's fine to use double for now, but let's figure out an overall plan before we merge this branch in.

cguzman95 commented 4 years ago

Yep, I'm doing first the changes on reactions from CUDA folder, and after having all working (including taking a look on realtype), I will translate the changes to main reactions (rxns folder).

Related to this, can you help me with the compilation of the main reactions to accept CUDA code like __host__ __device__ ? In principle, they should be in .cpp or .cu extension to accept CUDA code (if .cpp then seems we should add to CUDA flags: -x cu).

mattldawson commented 4 years ago

sure, this seems like a CMake configuration question, right? I'm not an expert, but I can see if I can find something that will work. I'm ok with changing the extension, as long as it doesn't affect the non-GPU build process. Is it ok to work on your branch today? (I have a little time because I'm waiting on MONARCH runs, and the queue times are really long).

cguzman95 commented 4 years ago

My last branch in GitHub was #126, which is finish with a pull request. It would be better to create another branch. It should be only a change in the extension to .cpp and make some function like __device__ hello_world() works (or at least compile them correctly on .cpp and __device__)

cguzman95 commented 4 years ago

Another thing, doing this I've noticed that in the future we should reduce model_data in little structs. Explanation:

In CPU code of, for example, rxn_arrhenius_calc_deriv_contrib, we access the current pointer to state by:

double *state = model_data->grid_cell_state;

What happens if we want to translate this to GPU? Well, since each must have a different pointer to grid_cell_state, they must also have a different model_data object (otherwise, they will point to the same place). So, basically, we are duplicating all the model_data for each thread despite we are only using grid_cell_state, grid_cell_env and n_rxn (the last is for a GPU optimization).

It shouldn't be a problem at the moment, but it will be when we want to squeeze the GPU. And it will be nice to discuss the design of these little structs when the times come.

mattldawson commented 4 years ago

Yeah, sounds like a good plan - seems like this could be related to #99. When would you need this by? Could this be part of the move to c++, or sooner?

cguzman95 commented 4 years ago

I need first that you check the pull request #126 or we fix #125 to test correctly multi-cells and gpu. If testing the arrhenius I see that this implementation with model_data is taking significantly more time than without it, we can discuss it soon. But I believe it won't be in that way, so probably this would be part of c++ development

cguzman95 commented 4 years ago

Pending move the dependencies between fortran and rxn.c file (e.g. the call to rxn_update_data is done in fortran instead from camp_solver.c). These dependencies avoid the code translation from .c to .cpp files (since fortran will call a .cpp file instead of a .c file, so some different treatment should be done).

This issue also will wait for the finalization of #128, since both issues conflict in the jacobian code.