Feature/15 gpu event based staged for pr

ascot4fusion / ascot5

ASCOT5 is a high-performance orbit-following code for fusion plasma physics and engineering

https://ascot4fusion.github.io/ascot5/

GNU Lesser General Public License v3.0

31 stars 9 forks source link

Feature/15 gpu event based staged for pr #124

Closed miekkasarki closed 1 month ago

miekkasarki commented 2 months ago

Merging GPU ported code. The physics that were ported are orbit-following and Coulomb collisions in GO mode and distribution diagnostics.

@peyberne I made a separate branch from feature/15-gpu-event-based which is now under review. I'll make my modifications here where necessary, and then we can go those together before merging. I'll let you know once I'm finished.

miekkasarki commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:

#pragma omp declare target
...
#pragma omp end declare target

These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere.

Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.

miekkasarki commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere.

Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.

There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.

peyberne commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere. Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.
There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.

I replaced some #pragma omp declare target by DECLARE_TARGET when the corresponding routines are used in GPU regions. it refers to openmp or openacc pragmas. For those routine I put also DECLARE_TARGET_END , for instance in src/plasma.h.

peyberne commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere. Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.
There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.
I replaced some #pragma omp declare target by DECLARE_TARGET when the corresponding routines are used in GPU regions. it refers to openmp or openacc pragmas. For those routine I put also DECLARE_TARGET_END , for instance in src/plasma.h.

#ifndef GPU
DECLARE_TARGET_SIMD
#else
DECLARE_TARGET
#endif

In this case we can replace these lines by one line: DECLARE_TARGET_SIMD and modify in offload_acc_omp.h the corresponding pragma for openacc : #define DECLARE_TARGET_SIMD MY_PRAGMA(acc routine seq)

peyberne commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere. Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.
There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.
I replaced some #pragma omp declare target by DECLARE_TARGET when the corresponding routines are used in GPU regions. it refers to openmp or openacc pragmas. For those routine I put also DECLARE_TARGET_END , for instance in src/plasma.h.
#ifndef GPU
DECLARE_TARGET_SIMD
#else
DECLARE_TARGET
#endif
In this case we can replace these lines by one line: DECLARE_TARGET_SIMD and modify in offload_acc_omp.h the corresponding pragma for openacc : #define DECLARE_TARGET_SIMD MY_PRAGMA(acc routine seq)

I can modify the offload_acc_omp.h in this branch file if you agree ?

miekkasarki commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere. Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.
There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.
I replaced some #pragma omp declare target by DECLARE_TARGET when the corresponding routines are used in GPU regions. it refers to openmp or openacc pragmas. For those routine I put also DECLARE_TARGET_END , for instance in src/plasma.h.
#ifndef GPU
DECLARE_TARGET_SIMD
#else
DECLARE_TARGET
#endif
In this case we can replace these lines by one line: DECLARE_TARGET_SIMD and modify in offload_acc_omp.h the corresponding pragma for openacc : #define DECLARE_TARGET_SIMD MY_PRAGMA(acc routine seq)
I can modify the offload_acc_omp.h in this branch file if you agree ?

Yes, this is a clean solution. Please go ahead.

miekkasarki commented 2 months ago

Few more changes that I plan to make unless there's a reason not to do so:

Moving n_queue_size to the particle_simd_fo struct since these are always used together so no need to have it as additional parameter for functions.
Removing #include <immintrin.h> or is it needed by nvc? I think that even on the main CPU branch that was legacy code and I actually thought I already removed them at some point.
Moving all "not implemented for GPUs yet->abort" to a single spot in simulate.c, since we can catch those immediately after options and all inputs are read.
Adding documentation to the custom pragmas in offload_acc_omp.h.
Enable the monitor thread in simulate.c when GPU=0.
Revert changes in src/simulate/mccc/mccc_wiener.c and src/simulate/mccc/mccc_wiener.h unless doing that prevents the code from compiling. That module is used only in guiding center mode which hasn't been ported yet, so best not to mix it in this PR.
Combine the two Makefiles.

From what I've seen so far, there's no need for any major changes. The only thing what I haven't properly looked into yet is the actual offloading. I'll go through everything else and then start to think about what to do with that. It could make sense to merge them as is, and then have another PR for removing the old infrastructure.

(Btw I really like the color coded error messages in the log. That hasn't even crossed my mind but I will steal that idea 😉)

peyberne commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere. Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.
There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.
I replaced some #pragma omp declare target by DECLARE_TARGET when the corresponding routines are used in GPU regions. it refers to openmp or openacc pragmas. For those routine I put also DECLARE_TARGET_END , for instance in src/plasma.h.
#ifndef GPU
DECLARE_TARGET_SIMD
#else
DECLARE_TARGET
#endif
In this case we can replace these lines by one line: DECLARE_TARGET_SIMD and modify in offload_acc_omp.h the corresponding pragma for openacc : #define DECLARE_TARGET_SIMD MY_PRAGMA(acc routine seq)
I can modify the offload_acc_omp.h in this branch file if you agree ?
Yes, this is a clean solution. Please go ahead.

ok with new offload_acc_omp.h, we can proceed as follows:

For routines where #pragma omp declare simd (uniform) is used BUT not declared for gpu, no changes.
For routines where #pragma omp declare simd (uniform) is used and declared for GPU, typically:
```
#ifndef GPU
DECLARE_TARGET_SIMD_UNIFORM(pls_data)
#else
DECLARE_TARGET
#endif
```
we can replace it by :
```
GPU_DECLARE_TARGET_SIMD[_UNIFORM(x)]
```
I can do it if you agree ?

peyberne commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere. Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.
There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.
I replaced some #pragma omp declare target by DECLARE_TARGET when the corresponding routines are used in GPU regions. it refers to openmp or openacc pragmas. For those routine I put also DECLARE_TARGET_END , for instance in src/plasma.h.
#ifndef GPU
DECLARE_TARGET_SIMD
#else
DECLARE_TARGET
#endif
In this case we can replace these lines by one line: DECLARE_TARGET_SIMD and modify in offload_acc_omp.h the corresponding pragma for openacc : #define DECLARE_TARGET_SIMD MY_PRAGMA(acc routine seq)
I can modify the offload_acc_omp.h in this branch file if you agree ?
Yes, this is a clean solution. Please go ahead.
ok with new offload_acc_omp.h, we can proceed as follows:
* For routines where `#pragma omp declare simd (uniform)` is used BUT not declared for gpu, no changes.

* For routines where `#pragma omp declare simd (uniform)` is used and declared for GPU, typically:
#ifndef GPU
DECLARE_TARGET_SIMD_UNIFORM(pls_data)
#else
DECLARE_TARGET
#endif
we can replace it by :
GPU_DECLARE_TARGET_SIMD[_UNIFORM(x)]
I can do it if you agree ?

Actually I started the modifs :-), tell me if I can push them !

miekkasarki commented 2 months ago

@peyberne A question about these two pragmas that declare the functions on target:
#pragma omp declare target
...
#pragma omp end declare target
These have been (mostly?) removed but in some cases the first one is replaced with the custom pragma DECLARE_TARGET. There's also the corresponding end pragma DECLARE_TARGET_END but I don't see it used anywhere. Can you specify whether the original pragmas should be removed completely or should they be replaced with the custom pragmas? The other pragma, DECLARE_TARGET_SIMD, is being used consistently.
There are also some older code I think where #ifndef GPU is used instead of the custom pragmas. I'll update those.
I replaced some #pragma omp declare target by DECLARE_TARGET when the corresponding routines are used in GPU regions. it refers to openmp or openacc pragmas. For those routine I put also DECLARE_TARGET_END , for instance in src/plasma.h.
#ifndef GPU
DECLARE_TARGET_SIMD
#else
DECLARE_TARGET
#endif
In this case we can replace these lines by one line: DECLARE_TARGET_SIMD and modify in offload_acc_omp.h the corresponding pragma for openacc : #define DECLARE_TARGET_SIMD MY_PRAGMA(acc routine seq)
I can modify the offload_acc_omp.h in this branch file if you agree ?
Yes, this is a clean solution. Please go ahead.
ok with new offload_acc_omp.h, we can proceed as follows:
* For routines where `#pragma omp declare simd (uniform)` is used BUT not declared for gpu, no changes.

* For routines where `#pragma omp declare simd (uniform)` is used and declared for GPU, typically:
#ifndef GPU
DECLARE_TARGET_SIMD_UNIFORM(pls_data)
#else
DECLARE_TARGET
#endif
we can replace it by :
GPU_DECLARE_TARGET_SIMD[_UNIFORM(x)]
I can do it if you agree ?

Sure, please go ahead.

How about the "bare" acc pragmas which are not behind ACC=1 guards. For example in src/simulate/step/step_fo_vpa.c we have #pragma acc data present(h[0:n_queue_size]) where pragma omp simd used to be. Does that still work with OpenMP nonetheless?

peyberne commented 2 months ago

Few more changes that I plan to make unless there's a reason not to do so:
1. Moving `n_queue_size` to the `particle_simd_fo` struct since these are always used together so no need to have it as additional parameter for functions.

2. Removing `#include <immintrin.h>` or is it needed by nvc? I think that even on the main CPU branch that was legacy code and I actually thought I already removed them at some point.

3. Moving all "not implemented for GPUs yet->abort" to a single spot in simulate.c, since we can catch those immediately after options and all inputs are read.

4. Adding documentation to the custom pragmas in `offload_acc_omp.h`.

5. Enable the monitor thread in simulate.c when GPU=0.

6. Revert changes in `src/simulate/mccc/mccc_wiener.c` and `src/simulate/mccc/mccc_wiener.h` unless doing that prevents the code from compiling. That module is used only in guiding center mode which hasn't been ported yet, so best not to mix it in this PR.

7. Combine the two Makefiles.
From what I've seen so far, there's no need for any major changes. The only thing what I haven't properly looked into yet is the actual offloading. I'll go through everything else and then start to think about what to do with that. It could make sense to merge them as is, and then have another PR for removing the old infrastructure.

(Btw I really like the color coded error messages in the log. That hasn't even crossed my mind but I will steal that idea 😉)

#include <immintrin.h> is not needed by nvc
for this point, notice that omp nested is deprecated (used in ascot5_main.c)

peyberne commented 2 months ago

In endcond.c file you removed the ifndef GPU in (lines 220-229) :

#ifndef GPU
            /* Check if the time spent simulating this marker exceeds the
             * given limit*/
            if(active_cpumax) {
                if(p_f->cputime[i] > sim->endcond_max_cputime) {
                    p_f->endcond[i] |= endcond_cpumax;
                    p_f->running[i] = 0;
                }
            }
#endif

I put this ifndef gpu because the main loop in simulate_fo_fixed.c takes a lot of time in gpu since it computes all the markers in gpu (instead of NSIMD markers in CPU). If you remove this condition, the gpu version stops before the end of the simulation. Maybe there is another way to avoid this ?

miekkasarki commented 2 months ago

In endcond.c file you removed the ifndef GPU in (lines 220-229) :
#ifndef GPU
            /* Check if the time spent simulating this marker exceeds the
             * given limit*/
            if(active_cpumax) {
                if(p_f->cputime[i] > sim->endcond_max_cputime) {
                    p_f->endcond[i] |= endcond_cpumax;
                    p_f->running[i] = 0;
                }
            }
#endif
I put this ifndef gpu because the main loop in simulate_fo_fixed.c takes a lot of time in gpu since it computes all the markers in gpu (instead of NSIMD markers in CPU). If you remove this condition, the gpu version stops before the end of the simulation. Maybe there is another way to avoid this ?

I guess we gave you a test case which has this end condition enabled... It can be disabled in the simulation options. If you install the Python package "a5py" that comes with the source code, you can disable this end condition with the command line tool like this: a5editoptions ascot.h5 -> ENDCOND_CPUTIMELIM=0 -> Save and exit (type anything when prompted for a description of the data). Let me know if this works or I can disable this end condition for you if it's more convenient.

In actual simulations, this end condition is only used to ensure that the simulation finishes in a time that was allocated by SLURM.

miekkasarki commented 2 months ago

@peyberne I'm now content with how the code looks like. As we discussed in the meeting, I'd still like to distribute contents of the copytogpu.c to the specific input modules (making it easier to maintain when e.g. the structs are modified). I was able to separate marker offloading from the rest but I wasn't able to, for example, move this:

GPU_MAP_TO_DEVICE(
    sim->wall_data.w2d.wall_r[0:sim->wall_data.w2d.n],sim->wall_data.w2d.wall_z[0:sim->wall_data.w2d.n] )

to the corresponding init in wall_2d.c:

void wall_2d_init(wall_2d_data* w, wall_2d_offload_data* offload_data,
                  real* offload_array) {
    w->n = offload_data->n;
    w->wall_r = &offload_array[0];
    w->wall_z = &offload_array[offload_data->n];

    GPU_MAP_TO_DEVICE(w->wall_r[0:w->n],w->wall_z[0:w->n])
}

as is my plan. However, as I said I'm content with how the code is right now and I'm postponing finishing this relocation once I've first removed the redundant old offloading infrastructure (in a separate PR) which might interfere.

So for the next steps I propose that:

Please check whether you agree with my commits.
You'll check that I didn't break anything 😅 and the performance is still good. At least my tests are showing that the code still runs on GPUs with ACC=1, but I didn't check OpenMP (GPU/CPU).
If there were no issues, I think it'd be a good idea to rebase the feature/15-gpu-event-based branch to this one.
Meanwhile I'll make release 5.5 that is long due and which also fixes the testing pipeline.
Once we are both done, I'll rebase this to 5.5.
Again we test that the performance and the physics are still ok.
I'll make a release 5.6 from this branch(!)
Rebranch or rebase feature/15-gpu-event-based to the main to continue GPU development

For point 0. It's fine with me. For point 1. I checked with OpenACC, the results are ok. We have to check with OpenMP Offload (on Intel)

peyberne commented 2 months ago

In endcond.c file you removed the ifndef GPU in (lines 220-229) :
#ifndef GPU
            /* Check if the time spent simulating this marker exceeds the
             * given limit*/
            if(active_cpumax) {
                if(p_f->cputime[i] > sim->endcond_max_cputime) {
                    p_f->endcond[i] |= endcond_cpumax;
                    p_f->running[i] = 0;
                }
            }
#endif
I put this ifndef gpu because the main loop in simulate_fo_fixed.c takes a lot of time in gpu since it computes all the markers in gpu (instead of NSIMD markers in CPU). If you remove this condition, the gpu version stops before the end of the simulation. Maybe there is another way to avoid this ?
I guess we gave you a test case which has this end condition enabled... It can be disabled in the simulation options. If you install the Python package "a5py" that comes with the source code, you can disable this end condition with the command line tool like this: a5editoptions ascot.h5 -> ENDCOND_CPUTIMELIM=0 -> Save and exit (type anything when prompted for a description of the data). Let me know if this works or I can disable this end condition for you if it's more convenient.

In actual simulations, this end condition is only used to ensure that the simulation finishes in a time that was allocated by SLURM.

Fine it works !

miekkasarki commented 2 months ago

The release 5.5.5 has been released and I've rebased this branch to 5.5.5 in my local copy, so I'm ready to force-push whenever. Just let me know when you're ready.

peyberne commented 2 months ago

The release 5.5.5 has been released and I've rebased this branch to 5.5.5 in my local copy, so I'm ready to force-push whenever. Just let me know when you're ready.

Ok, as soon as we have access to the intel gpu, we test OpenmpOffload.

peyberne commented 1 month ago

The release 5.5.5 has been released and I've rebased this branch to 5.5.5 in my local copy, so I'm ready to force-push whenever. Just let me know when you're ready.

Ok, as soon as we have access to the intel gpu, we test OpenmpOffload.

Test on intel gpu ponte vecchio with openmp Offload works well. this PR is ok for us @miekkasarki !

miekkasarki commented 1 month ago

I've rebased feature/15-gpu-event-based to feature/15-gpu-event-based-staged-for-pr (at commit when all tests passed) and rebased feature/15-gpu-event-based-staged-for-pr to current main.

This latest version compiles but the GPU utilization is very low so something broke during the rebase. I'll try to hunt it down.

miekkasarki commented 1 month ago

I've rebased feature/15-gpu-event-based to feature/15-gpu-event-based-staged-for-pr (at commit when all tests passed) and rebased feature/15-gpu-event-based-staged-for-pr to current main.

This latest version compiles but the GPU utilization is very low so something broke during the rebase. I'll try to hunt it down.

These are the offending lines in simulate_fo_fixed.c. Removing them brings the GPU utilization back to the expected level. Now what's wrong here...

random_normal_simd(&sim->random_data, 3*p.n_mrk, rnd);
mccc_fo_euler(p_ptr, hin, &sim->plasma_data, &sim->mccc_data, rnd);

miekkasarki commented 1 month ago

I've rebased feature/15-gpu-event-based to feature/15-gpu-event-based-staged-for-pr (at commit when all tests passed) and rebased feature/15-gpu-event-based-staged-for-pr to current main. This latest version compiles but the GPU utilization is very low so something broke during the rebase. I'll try to hunt it down.

These are the offending lines in simulate_fo_fixed.c. Removing them brings the GPU utilization back to the expected level. Now what's wrong here...
random_normal_simd(&sim->random_data, 3*p.n_mrk, rnd);
mccc_fo_euler(p_ptr, hin, &sim->plasma_data, &sim->mccc_data, rnd);

The problem was on my end (wrong compiler flags). Nothing wrong with the code.

I fixed any remaining issues. If the tests pass, I've got no objection for making the release. @peyberne can you run the benchmarks for this branch and then we are ready.

peyberne commented 1 month ago

I've rebased feature/15-gpu-event-based to feature/15-gpu-event-based-staged-for-pr (at commit when all tests passed) and rebased feature/15-gpu-event-based-staged-for-pr to current main. This latest version compiles but the GPU utilization is very low so something broke during the rebase. I'll try to hunt it down.

These are the offending lines in simulate_fo_fixed.c. Removing them brings the GPU utilization back to the expected level. Now what's wrong here...
random_normal_simd(&sim->random_data, 3*p.n_mrk, rnd);
mccc_fo_euler(p_ptr, hin, &sim->plasma_data, &sim->mccc_data, rnd);
The problem was on my end (wrong compiler flags). Nothing wrong with the code.

I fixed any remaining issues. If the tests pass, I've got no objection for making the release. @peyberne can you run the benchmarks for this branch and then we are ready.

I performed the tests, all is ok on my side !