SLACKHA / pyJac

Creates C and CUDA analytical Jacobians for chemical kinetics ODE systems
http://slackha.github.io/pyJac/
MIT License
52 stars 23 forks source link

Question regarding OpenMP parallelization of Jacobian #20

Closed BarrySmith closed 5 years ago

BarrySmith commented 7 years ago

I ran pyJac on a mechanism with 52 species; http://combustion.berkeley.edu/gri_mech/version30/files30/grimech30.dat

and look at the generated code. The only mention of OpenMP in the generated code is in header.h

// OpenMP

ifdef _OPENMP

include

else

define omp_get_max_threads() 1

define omp_get_num_threads() 1

endif

so it seems the generated code won't use OpenMP. There are no OpenMP pragama's in the code nor any OpenMP function calls.

I thought pyJac would generate OpenMP parallel code based on its documentation. Am I missing something? Is the mechanism too small?

Thanks

skyreflectedinmirrors commented 7 years ago

Hi Barry, The code in header.h is there simply to define dummy macros to make life easier during compilation if the user chooses not to use OpenMP, or it's not present on the system etc.

OpenMP enters into pyJac in two places:

  1. During compilation, providing the flag '-fopenmp' to the compiler during compilation/linking inserts the OpenMP headers and library into the resulting executable.
  2. OpenMP can be used to compute the values of Jacobians for multiple sets of thermochemical conditions in parallel, for an example see pyjac/performance_tester/tester.c.in:
        #pragma omp parallel for
        for(int tid = 0; tid < num_odes; ++tid)
        {
            double jac[NSP * NSP] = {0};
            eval_jacob(0, var_host[tid], &y_host[tid * NN], jac);
            ...
        }

This file also has some functionality for setting the number of threads to use, etc.

pyJac currently doesn't support (but will in the next version) parallel evaluation of a Jacobian for a single initial condition, that is using multiple threads to evaluate a single Jacobian in parallel. Additionally, the python interface is currently a simple wrapper for the bare C-code and doesn't support OpenMP out of the box (it evaluates a single Jacobian for a single set of thermochemical conditions), although it would be relatively easy to extend it to do so

BarrySmith commented 7 years ago

Thank you for the quick response and information.

BarrySmith commented 7 years ago

Reopening the issue to ask more questions. We are interested in running pretty big reaction networks and when we do run them about 50% of the total runtime is in the computation of the Jacobian. This is why we asked about OpenMP

How difficult do you think it would be to generate Jacobian code that could have places to "drop in" OpenMP pragamas? The problem with the currently generated code is that it jumps all over the place in terms of computing Jacobian entries so seems no easy way to use it with OpenMP. For example if you generated the Jacobian entries a row (or several) rows at a time it might be possible to use it with OpenMP. We realize that this might result in some redundant computations but if the redundency is not too high on might still get a good speed up due to the parallelism.

We'd be eager testers and users of such a thing but unfortunately we don't have the expertise or time to directly help you with implementing a new generator.

skyreflectedinmirrors commented 7 years ago

Hi Barry, If I understand what you're suggesting -- that is, parallel evaluation of entries inside a Jacobian for the same set of thermo-chemical conditions -- this isn't currently possibly in pyJac, we only expose parallelism in evaluating the Jacobian across many different sets of thermo-chemical conditions, either via OpenMP or on the GPU via CUDA. The reason being is that the code pyJac spits out is pretty fully unrolled (no loops, and all thermo parameters, etc. are explicitly in the code). We would have to figure out some way undo all that, which would be tough.

However, it's not all bad news. We are currently working on a new version of pyJac which does support this form of parallelization, (as well as adding vectorized evaluation via OpenCL, a much more sparse matrix via a change of variables in the system and other nice features). We are close to our first alpha release of this new version. If you want to join our user group I'll be sure to ping you when it's available, and work with you if you encounter any problems, have any feature requests etc.

kyleniemeyer commented 7 years ago

Hi @BarrySmith, thanks for your interest—your use case is definitely something we're interested in. I'll second @arghdos's suggestion of joining the User's Group for the project: https://groups.io/g/slackha-users

The email frequency is nearly zero (I don't think we've yet sent a single email).

BarrySmith commented 7 years ago

Thanks for the response. we'll definitely join the mailing list and eagerly await trying out the new features.

Barry

Shashi, the list is at https://groups.io/g/slackha-users

On Sep 6, 2017, at 12:54 PM, arghdos notifications@github.com wrote:

Hi Barry, If I understand what you're suggesting -- that is, parallel evaluation of entries inside a Jacobian for the same set of thermo-chemical conditions -- this isn't currently possibly in pyJac, we only expose parallelism in evaluating the Jacobian across many different sets of thermo-chemical conditions, either via OpenMP or on the GPU via CUDA. The reason being is that the code pyJac spits out is pretty fully unrolled (no loops, and all thermo parameters, etc. are explicitly in the code). We would have to figure out some way undo all that, which would be tough.

However, it's not all bad news. We are currently working on a new version of pyJac which does support this form of parallelization, (as well as adding vectorized evaluation via OpenCL, a much more sparse matrix via a change of variables in the system and other nice features). We are close to our first alpha release of this new version. If you want to join our user group I'll be sure to ping you when it's available, and work with you if you encounter any problems, have any feature requests etc.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

kahilah commented 7 years ago

Hey,

I was also wondering this openMP possibility in pyjac context some time ago. I hand tuned reaction rate computations to include "section" pragmas but at least for rather small mechanisms (300 reactions or so) I was not able to make any literal speed-up. On the other hand it was my first quick learning session of openMP so I kept things simple and probably went off the track somewhere along the lines.

I'm really really interested to see the new pyjac version and how things work in there.

-Heikki

skyreflectedinmirrors commented 7 years ago

@kahilah interesting, I didn't know about the #sections pragma. That would definitely be the easiest way to implement parallelism inside the Jacobian evaluation for the current version.

I think you could fairly easily drop a #pragma omp parallel sections around the species derivative evaluations (in eval_jacob_X before the //Finish dYk / Yj's), and after around the //And dT/dYj's methods. You would probably have to put #pragma openmp atomic in front of the jac updates though, to ensure correctness (and maybe a #pragma openmp flush before the //Complete dT wrt T calculations to synchronize the memory)

Interesting that you didn't see any speedup. In my opinion it's likely that the reaction rate computations are pretty overhead compared to the overall Jacobian evaluation, so if you were timing that it any speedup may have been hidden

kahilah commented 7 years ago

How I made my test was that I added #pragma omp parallel sections to the rxn_rates.c and enveloped all the reactions within two "sections" which I forced to be run with two threads. Result was something that N evaluations of dydt() was around 0-5% faster with openmp. If I increased threads to four with four sections, the runtime was the same or slightly longer than without openmp. I concluded that the overhead of the thread creation per rxn_rates call is too large to see any better speed-up. However, like I said in my previous post, this is not maybe the whole truth and should be verified. I could look at my test codes in some point again.

I think that this sections pragma would work much better in jacobian computations where the possible overhead is smaller. At that time I was (and still am) just interested to speed up dydt calls.

-Heikki