When I tested the current code on an AMD machine I found that sw_dif_and_sourceis not vectorizing with either ifort 2021.6.0 or GNU Fortran 11.3 compilers.
Intel's optimization report says the following:
LOOP BEGIN at ../rte/kernels/mo_rte_solver_kernels.F90(1389,7)
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed ANTI dependence between DIR_FLUX_INC(i) (1465:11) and DIR_FLUX_TRANS(i) (1471:11)
LOOP END
(Ignore line numbers, I've added things - but this is for the version of sw_dif_and_source in the main branch).
Timings confirm sw_dif_and_source is very slow due to this:
This is with Intel. With gfortran, there is another serious issue: the mu0 conditional prevents vectorization. Removing it halved the total runtime of the RFMIP SW program (2 -> 1 second, now faster than ifort!). Perhaps the code could be changed so that the mo_rte_sw checks that mu0 is positive in all the input columns, otherwise calls the slower version with mu0 conditional inside kernels? Or you could even write the functionality into the same (but admittedly more bloated) SW kernel with an argument check_for_positive_mu0 (perhaps some host models have already removed nighttime columns before calling RTE-SW, so they can set it to false).
Hello,
When I tested the current code on an AMD machine I found that
sw_dif_and_source
is not vectorizing with either ifort 2021.6.0 or GNU Fortran 11.3 compilers.Intel's optimization report says the following:
(Ignore line numbers, I've added things - but this is for the version of sw_dif_and_source in the main branch).
Timings confirm
sw_dif_and_source
is very slow due to this:This can be fixed by adding
!$OMP SIMD
before the ncol loop:This is with Intel. With gfortran, there is another serious issue: the mu0 conditional prevents vectorization. Removing it halved the total runtime of the RFMIP SW program (2 -> 1 second, now faster than ifort!). Perhaps the code could be changed so that the mo_rte_sw checks that mu0 is positive in all the input columns, otherwise calls the slower version with mu0 conditional inside kernels? Or you could even write the functionality into the same (but admittedly more bloated) SW kernel with an argument check_for_positive_mu0 (perhaps some host models have already removed nighttime columns before calling RTE-SW, so they can set it to false).