Open cguzman95 opened 4 years ago
I think this is a great idea. Reducing duplicate code is always good, and I think this isn't "ugly" at all really - seems like the best possible implementation. How/when would you like to implement this? I'm still working on the MONARCH implementation (which is moving pretty well). After that, I can help if needed.
I can start it on Monday. I will push first the pull request of #126, and then continue with this, since it sounds good for you too.
Note: Conflict between aero_rep_get_number_conc
and aero_phase_get_volume
from HL & SIMPOL reaction types: Both reactions calls on calc_deriv
the function aero_rep_get_number_conc
which calls function aero_phase_get_volume
. This last function is also called for cpu functions apart from calc_deriv like aero_rep_modal_binned_mass_update_state
, so, to compile calc_deriv
of HL & SIMPOL on gpu we need to broke this dependance first (or wait till we discuss about aero_rep and other reactions reordering and re-structuring of data)
Note: realtype
is not detected as a type on GPU compilation, raising an error. It's okay if I replace realtype
for double
at least for the moment?
Hi @cguzman95 - I think it would be better if we found solutions to these problems now. I don't want to start introducing a lot of patches and work-arounds into the code just to get something to run, even if it takes a little longer in the design phase. For the aero_phase_get_volume
is the problem that you need to add the __device__
flag to access the function on the GPUs but that somehow makes it unavailable for calls from the CPU?
For the realtype
we should discuss a plan to handle precision that works with CVODE and the GPUs. Do you know what the specific problem is using realtype
on the GPUs?
About the first question: Yes, if a function is declared with __device__
it can't be called from CPU (and there is a case that a function is called from CPU and from GPU). I can propose a solution better than not computing these reactions in GPU (I need first to check better the code, but shouldn't be a problem)
About the second question: I will investigate further that and solutions. If I remember well from when I was investigating this, the main problem was the atomicadd
function (it needs to pass a known type like double or float, but I think that exists a more "general" function than atomicadd that allows all types... not sure now)
Hi @cguzman95 - that sounds great. For the realtype
, I'm fine with using double
, but let's come up with a good plan of where in the code to switch from realtype
to double
. CVODE uses realtype
and can be configured to use float
instead of double
, and in the future we may use something other than CVODE for the solver, so I'm thinking the switch to double could be somewhere close to the interface with CVODE. What do you think? We could also think about if and how we want to be able to use different precision in our code (I don't know what the best way to implement this would be)
for the __device__
function, that's great if you can propose something. Worst case we can make two functions, but maybe there's something we could do at compile time to use the same code for both or something? or, maybe you find a better solution.
For the realtype
- yeah could be nice, however, it should be done in the distant future, so I don't mind doing all as double
and later do a refactor to realtype
. But yeah, if you want to deal first with this kind of problem, I won't disagree.
For the second: Now that I google a bit, seems there are nice solutions https://codeyarns.com/2011/03/14/cuda-common-function-for-both-host-and-device-code/
That looks like a perfect solution. For the realtype
that's fine to use double
for now, but let's figure out an overall plan before we merge this branch in.
Yep, I'm doing first the changes on reactions from CUDA folder, and after having all working (including taking a look on realtype
), I will translate the changes to main reactions (rxns folder).
Related to this, can you help me with the compilation of the main reactions to accept CUDA code like __host__ __device__
? In principle, they should be in .cpp or .cu extension to accept CUDA code (if .cpp then seems we should add to CUDA flags: -x cu).
sure, this seems like a CMake configuration question, right? I'm not an expert, but I can see if I can find something that will work. I'm ok with changing the extension, as long as it doesn't affect the non-GPU build process. Is it ok to work on your branch today? (I have a little time because I'm waiting on MONARCH runs, and the queue times are really long).
My last branch in GitHub was #126, which is finish with a pull request. It would be better to create another branch. It should be only a change in the extension to .cpp and make some function like __device__ hello_world()
works (or at least compile them correctly on .cpp and __device__
)
Another thing, doing this I've noticed that in the future we should reduce model_data in little structs. Explanation:
In CPU code of, for example, rxn_arrhenius_calc_deriv_contrib
, we access the current pointer to state
by:
double *state = model_data->grid_cell_state;
What happens if we want to translate this to GPU? Well, since each must have a different pointer to grid_cell_state, they must also have a different model_data object (otherwise, they will point to the same place). So, basically, we are duplicating all the model_data for each thread despite we are only using grid_cell_state
, grid_cell_env
and n_rxn
(the last is for a GPU optimization).
It shouldn't be a problem at the moment, but it will be when we want to squeeze the GPU. And it will be nice to discuss the design of these little structs when the times come.
Yeah, sounds like a good plan - seems like this could be related to #99. When would you need this by? Could this be part of the move to c++, or sooner?
I need first that you check the pull request #126 or we fix #125 to test correctly multi-cells and gpu. If testing the arrhenius I see that this implementation with model_data is taking significantly more time than without it, we can discuss it soon. But I believe it won't be in that way, so probably this would be part of c++ development
Pending move the dependencies between fortran and rxn.c file (e.g. the call to rxn_update_data is done in fortran instead from camp_solver.c). These dependencies avoid the code translation from .c to .cpp files (since fortran will call a .cpp file instead of a .c file, so some different treatment should be done).
This issue also will wait for the finalization of #128, since both issues conflict in the jacobian code.
Hi @mattldawson
I was wondering about the extra work on modifying the GPU code each time we modify something on the reactions code. CUDA code for reactions are, mostly, a copy of the CPU code (except for the function declaration and the atomicadd to save the results in deriv/Jac). So each time we modify the code on CPU we must copy the changes in CUDA and taking care of don't delete the GPU implementations.
So, I managed to find a possible solution. We can profit from the PMC_USE_GPU flag to choose to compile CPU code or GPU code since we will use only one of the cases. To illustrate the idea, the code will be something like this:
I already tested it and works, so it only left to discuss this. I like the idea if we will be doing changes frequently to the reactions. Also, it is a good idea in order to explain to the user how to add a reaction, since it won't be so "strange" as say to the user if he wants to compute his new reaction on a GPU, he needs to copy/paste the code.
However, I understand that this can be a bit "ugly" , maybe mess up a bit the code and I also like having all the CUDA code in his own folder. So, what do you think?