RTE shortwave kernel not vectorizing

Hello,

When I tested the current code on an AMD machine I found that sw_dif_and_sourceis not vectorizing with either ifort 2021.6.0 or GNU Fortran 11.3 compilers.

Intel's optimization report says the following:

   LOOP BEGIN at ../rte/kernels/mo_rte_solver_kernels.F90(1389,7)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
      remark #15346: vector dependence: assumed ANTI dependence between DIR_FLUX_INC(i) (1465:11) and DIR_FLUX_TRANS(i) (1471:11)
   LOOP END

(Ignore line numbers, I've added things - but this is for the version of sw_dif_and_source in the main branch).

Timings confirm sw_dif_and_source is very slow due to this:

                               Called  Recurse     Wall      max      min %_of_clea
  clear_sky_total (SW)           1     -       1.946    1.946    1.946   100.00
    gas_optics (SW)            100     -       0.773 8.83e-03 7.65e-03    39.70
    rte_sw                     100     -       1.173    0.017    0.012    60.26
      sw_2stream               100     -       1.170    0.017    0.012    60.12
        sw_dif_and_source    11200     -       1.109 2.79e-03 9.20e-05    56.97
        sw_adding            11200     -       0.043 8.00e-06 3.00e-06     2.19

This can be fixed by adding !$OMP SIMD before the ncol loop:

   LOOP BEGIN at ../rte/kernels/mo_rte_solver_kernels.F90(1389,7)
      remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
      remark #15442: entire loop may be executed in remainder
      remark #15448: unmasked aligned unit stride loads: 1 
      remark #15449: unmasked aligned unit stride stores: 1 
      remark #15450: unmasked unaligned unit stride loads: 6 
      remark #15451: unmasked unaligned unit stride stores: 4 
      remark #15456: masked unaligned unit stride loads: 3 
      remark #15457: masked unaligned unit stride stores: 3 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 470 
      remark #15477: vector cost: 249.250 
      remark #15478: estimated potential speedup: 1.860 
      remark #15482: vectorized math library calls: 2 
      remark #15486: divides: 3 
      remark #15488: --- end vector cost summary ---
   LOOP END

                            Called  Recurse     Wall      max      min %_of_clea
  clear_sky_total (SW)           1     -       1.362    1.362    1.362   100.00
    gas_optics (SW)            100     -       0.781    0.013 7.64e-03    57.34
    rte_sw                     100     -       0.580 6.01e-03 5.76e-03    42.60
      sw_2stream               100     -       0.577 5.98e-03 5.71e-03    42.38
        sw_dif_and_source    11200     -       0.470 6.30e-05 4.00e-05    34.53
        sw_adding            11200     -       0.043 1.00e-05 3.00e-06     3.19

This is with Intel. With gfortran, there is another serious issue: the mu0 conditional prevents vectorization. Removing it halved the total runtime of the RFMIP SW program (2 -> 1 second, now faster than ifort!). Perhaps the code could be changed so that the mo_rte_sw checks that mu0 is positive in all the input columns, otherwise calls the slower version with mu0 conditional inside kernels? Or you could even write the functionality into the same (but admittedly more bloated) SW kernel with an argument check_for_positive_mu0 (perhaps some host models have already removed nighttime columns before calling RTE-SW, so they can set it to false).

earth-system-radiation / rte-rrtmgp

RTE shortwave kernel not vectorizing #215