OpenACC / openacc-best-practices-guide

The sources for the OpenACC Programming and Best Practices Guide.
Apache License 2.0
34 stars 13 forks source link

ordered schedule static in OpenACC #23

Open look4pritam opened 3 months ago

look4pritam commented 3 months ago

I am having code in Fortran which is optimised for OpenMP. I want to check the equivalent call in OpenACC for following OpenMP call. !$omp do ordered schedule(static)

When I searched on internet for ordered schedule static in OpenACC, then I am not able to find the equivalent call for ordered clause.

I just want to confirm it. If ordered clause is not there, then what is the alternative in OpenACC?

dphow commented 3 months ago

Thanks for reporting this. Let me connect with a colleague who works on the spec more to respond to this.

jefflarkin commented 3 months ago

@look4pritam There is no equivalent in OpenACC, since to the best of my memory this is the first it's been requested. Can you please share more details on why you need ordered execution so that we can either advise you on your code or discuss in the committee whether we should add something like ordered? If you have code we can look at, that'd really up.

look4pritam commented 3 months ago

I am having Fortran code for Monte Carlo Particle Transport. It is initially developed for single CPU version. Then it is optimized for multiple CPUs using OpenMP. Now I want to optimize the code for GPU using OpenACC as I can use OpenACC with Fortran code.

OpenMP code uses calls such as '!$omp do ordered schedule(static)'. It is done to have predictable and same results across single CPU and number of CPUs. 'ordered' clause is used to have predictable randomness.

I am thinking of replacing OpenMP calls with OpenACC calls. I will be happy, if you can suggest better approach, then it will help me.

jefflarkin commented 3 months ago

I see. Here's a little background on why ordered is a bit harder for OpenACC to implement. When we designed the programming model, we wanted to be able to support traditional shared memory parallelism, like legacy OpenMP, but also support other types of parallel devices, the most obvious being GPUs. When building codes for GPUs, it needs to be very dynamic and flexible to really take advantage of the available parallelism. We wanted the same code to be able to scale from a small laptop GPU to a large datacenter GPU, so the programming model forces you to write scalable parallelism. Historically you couldn't guarantee on a GPU the order threadblocks would run, or even whether any two threadblocks ran at the same time or in a particular order. There are now ways to do this, at least on NVIDIA GPUs, so what you're asking for is theoretically something that could be implemented on a GPU, but with tradeoffs that are likely to affect performance. For instance, if you limit the number of threadblocks enough, you can feel fairly confident that they will all run in a predictable order, but you'll probably be running with far less parallelism than the GPU is capable of using.

What I've seen done in other MC codes is to pre-compute random seeds into a buffer and then have the walkers consume those numbers in a predictable way. This may require additional memory to hold this randomness, but probably not a ton. I'll ask around my network and see if I can find anyone with direct experience doing this who can share their experience here.

I do want to tell you this. Early in the lifetime of OpenACC I encountered a lot of people who were disappointed by some missing feature they're used to having from OpenMP (usually something like a critical region, which is inherently parallelism-limiting). They didn't like that they needed to restructure the code to use OpenACC, but when they were done they had a code that was not only able to run well on a GPU, but it actually ran faster than the original on the CPU too. I'm hoping I can find someone with the right expertise to help make this a reality for you too.

look4pritam commented 3 months ago

What I've seen done in other MC codes is to pre-compute random seeds into a buffer and then have the walkers consume those numbers in a predictable way. This may require additional memory to hold this randomness, but probably not a ton. I'll ask around my network and see if I can find anyone with direct experience doing this who can share their experience here.

@jefflarkin You are right This is what exactly done in OpenMP - pre-computed random seeds and then these are used in ordered clause to get the job done in two steps.

If you can give the references for doing the same thing in OpenACC, then it will help me. @jefflarkin Thank you very much for your effort and time.