alashworth / test-issue-import

0 stars 0 forks source link

#pragma support (OpenMP specifically) #24

Open alashworth opened 5 years ago

alashworth commented 5 years ago

Issue by bgoodri Monday Dec 09, 2013 at 05:28 GMT Originally opened as https://github.com/stan-dev/stan/issues/439


  1. Amend the parser to not throw away anything to the right (inclusive) of #pragma ...
  2. There is no step 2.
alashworth commented 5 years ago

Comment by bgoodri Tuesday Dec 10, 2013 at 16:56 GMT


For some reason, this issue is not showing up in the issues list on GitHub, so mentioning @bob-carpenter to make sure it does not get lost

alashworth commented 5 years ago

Comment by bob-carpenter Tuesday Dec 10, 2013 at 19:35 GMT


I'm responding through GitHub to let you know it is showing up. At:

https://github.com/stan-dev/stan/issues/439

alashworth commented 5 years ago

Comment by bgoodri Sunday Apr 20, 2014 at 23:51 GMT


With g++-4.9 scheduled to be released on Tuesday and it supporting OpenMP 4.0, it would be nice if Stan could support pragmas. As far as I can tell, all that is needed is

  1. For the parser to interpret #pragma literally rather than a R-style comment
  2. For the parser to include such lines in the .cpp file
  3. Add a Makefile option to add -include omp.h -fopenmp to the compile statement

I know how to do 3, so that gets us halfway there.

alashworth commented 5 years ago

Comment by mbrubake Sunday Apr 20, 2014 at 23:56 GMT


Wouldn't the autodiff stack need to be made more threadsafe and efficient before this was a good idea?

alashworth commented 5 years ago

Comment by bgoodri Monday Apr 21, 2014 at 00:10 GMT


You may be right about loops that include sampling statements (that people should vectorize anyway) like

pragma omp parallel for

for(i in 1:N) y[i] ~ normal(0,1);

but the current generated .cpp files accumulate the statements that change the log-posterior until the end of the model block and evaluate them (serially) with

return lp_accum__.sum();

So, I don't think parallelizing the loop is thread unsafe even in that case. But if it is, then we can say to only parallelize pure assignment loops.

Ben

On Sun, Apr 20, 2014 at 7:56 PM, Marcus Brubaker notifications@github.comwrote:

Wouldn't the autodiff stack need to be made more threadsafe and efficient before this was a good idea?

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/439#issuecomment-40908253 .

alashworth commented 5 years ago

Comment by bob-carpenter Monday Apr 21, 2014 at 11:28 GMT


Michael --- you should check this out because it's related to the new auto-diff.

More inline.

On Apr 21, 2014, at 2:10 AM, bgoodri notifications@github.com wrote:

You may be right about loops that include sampling statements (that people should vectorize anyway) like

pragma omp parallel for

for(i in 1:N) y[i] ~ normal(0,1);

Correct --- these can't be executed in parallel because they modify the shared auto-diff stack.

but the current generated .cpp files accumulate the statements that change the log-posterior until the end of the model block and evaluate them (serially) with

return lp_accum__.sum();

What gets accumulated is the statement values --- the statements themselves are evaluated eagerly as they are found.

So, I don't think parallelizing the loop is thread unsafe even in that case. But if it is, then we can say to only parallelize pure assignment loops.

When you get to .sum(), it's not changing the auto-diff stack, just doing a bunch of operations on doubles. I'm not sure they'll be easy to parallelize given the indirection through vari, but it's safe.
I'd think the new approach Michael's working on should have better vectorization properties, but it's always hard to tell --- we were wrong about first order being faster for reasons we still don't quite understand.

Also, when we start doing things like Jacobian calculations and Hessians, there are embarassingly parallel optimization opportunities in the way we're doing things now that won't be so easy the way Michael's reformulating things with a reusable autodiff stack.

Ben

On Sun, Apr 20, 2014 at 7:56 PM, Marcus Brubaker notifications@github.comwrote:

Wouldn't the autodiff stack need to be made more threadsafe and efficient before this was a good idea?

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/439#issuecomment-40908253 .

— Reply to this email directly or view it on GitHub.

alashworth commented 5 years ago

Comment by betanalpha Monday Apr 21, 2014 at 12:52 GMT


Also, when we start doing things like Jacobian calculations and Hessians, there are embarassingly parallel optimization opportunities in the way we're doing things now that won't be so easy the way Michael's reformulating things with a reusable autodiff stack.

Actually this won’t be so bad given that much of the stack is read-only so we can still do some embarrassing parallelization (oh, how I hate that name by the way).

alashworth commented 5 years ago

Comment by bob-carpenter Monday Apr 21, 2014 at 13:21 GMT


On Apr 21, 2014, at 2:52 PM, Michael Betancourt notifications@github.com wrote:

Also, when we start doing things like Jacobian calculations and Hessians, there are embarassingly parallel optimization opportunities in the way we're doing things now that won't be so easy the way Michael's reformulating things with a reusable autodiff stack.

Actually this won’t be so bad given that much of the stack is read-only so we can still do some embarrassing parallelization

That's a good point. Rather than having a single shared stack for each row of the Jacobian, new ones could be created in different threads. So you'd have a pass to build up the partials, then multiple passes in parallel to do calcs, each with their own workspace.

This is kind of like what Adept did in creating the adjoint stack only at the end when needed.

(oh, how I hate that name by the way).

Agreed on the name. It's wrong because it's a good thing in practice --- it's just wrong for a paper on parallel processing.

alashworth commented 5 years ago

Comment by syclik Wednesday Nov 30, 2016 at 14:42 GMT


@bgoodri, is there a concrete suggestion you have? Is this something we need to code generate? Is this something that should go down in the math library? Maybe in the distributions?

alashworth commented 5 years ago

Comment by bgoodri Wednesday Nov 30, 2016 at 14:58 GMT


The first was "don't drop comments that start with a #", in which case users could do their own #pragmas. But users would still have to add -include omp.h -fopenmp to get it to work, and could crash their program (which I don't care much about) if they try to do something in parallel that has to be serial. The second was to introduce a pfor (j in 1:J) block that would code generate to

#pragma omp for
for (j in 1:J) {
  // do stuff that refers to j
}

in which case we would at least know to generate a line with #include omp.h and possibly we could do the for loop in a safer way. Sebastian was playing around with different #pragmas that would preclude some problems in Stan.

I doubt there is any distribution in Stan Math with accumulated analytical gradients where OpenMP would be worth the overhead. And Bob is suggesting that distributions without accumulated analytical gradients would be problematic for OpenMP. Eigen is doing some stuff like this for operations on giant matrices.

Also, none of the OpenMP stuff works on a Mac with Xcode. You have to install clang or g++ from upstream.

alashworth commented 5 years ago

Comment by syclik Wednesday Nov 30, 2016 at 15:13 GMT


Let's start with the first thing you mentioned: "don't drop comments that start with a #". Can you tell me if I'm right in my interpretation?

Regarding OpenMP, @rtrangucci and I actually got a version of clang++ installed (I forget how now, but it's not Xcode) and showed a little speed boost on a super-large matrix multiplication, but that's neither here nor there.

Should we close this issue and create two issues:

  1. something to do with # as directives and not comments
  2. creating a pfor function in the Stan language?
alashworth commented 5 years ago

Comment by bgoodri Wednesday Nov 30, 2016 at 16:05 GMT


If so, can you create a new issue that clearly states that?

That is literally the OP on this issue, but it is an open question as to whether that is a good idea. The pfor thing could be a separate issue, but we should probably do one or the other rather than both.

alashworth commented 5 years ago

Comment by syclik Wednesday Nov 30, 2016 at 16:27 GMT


The OP wasn't clear enough for me.

If you think we should do one and not both, please help us make that decision. What are the pros and cons of doing one or the other? What's your preference? Who's actually going to implement it and what's that person's preference?

alashworth commented 5 years ago

Comment by bob-carpenter Wednesday Nov 30, 2016 at 18:56 GMT


I suggest we also move this to longer-term issues until there's more of a design as to what we should do.

We don't need to decide who's going to implement something when we create an issue---it just has to be precise enough to be implemented.

For the OpenMP pragma, I don't think there's any way we could parallelize Stan code naively without synchronizing on the expression graph for autodiff. That would presumably require the pieces that are being executed in parallel to not touch the expression graph, which means analytical derivatives.

The case Sebastian was demo-ing ran multiple ODE solvers in parallel. And I think he had a way to synch the results (maybe another comment-like thing?)