Detailed Changes in the Pull Request

This pull request includes the following major changes to the mod_micro_nogtom module:

1. Added Compiler Directives

The code optimization involved the incorporation of OpenMP directives to leverage SIMD instructions, which significantly improved its performance. OpenMP directives were strategically placed to enable vectorization, specifically using the !$omp simd directives. This allowed the compiler to efficiently process multiple data elements in parallel, resulting in a performance boost. The compiler vectorization report was a valuable resource during this process, providing insights into potential areas for optimization and guiding the placement of OpenMP directives.

The !dir$ ivdep directive was added to inform the compiler that there are no dependencies in vectorizing the instructions. This directive ensures that the compiler generates code that can be executed without any conflicts or dependencies between the instructions.

The !dir$ vector always directive was added above the initialization of matrices like sumh1(:,:,:) = d_zero to ensure that the compiler always vectorizes them.

The directive !dir$ novector was added above loops that iterated from 1 to nqx to instruct the compiler not to vectorize those loops. The decision to add this directive was based on the observation that nqx was relatively small (found to be 5), which meant that vectorizing these loops may incur a significant overhead that could potentially decrease performance.

We had also added !$omp parallel do directives to check if threading could bring any performance improvements, but eventually, it turned out that the overheads of threading outstanded the performance improvement. We did not remove these compiler directives, but we run the application after exporting OMP_NUM_THREADS = 1, which also makes these directives redundant.

2. Performed Scalar Expansion

Scalar expansion has been performed on several arrays to allow for better vectorization of the loops. The following arrays have been expanded:

tnew_expanded
dp_expanded
qe_expanded
tmpl_expanded
tmpi_expanded
zdelta_expanded
phases_expanded

This optimization technique helped vectorize some loops, which could otherwise hae not been vectorized, due to reasons of overwriting the scalar variable.

Consider the following loop in the original code

do k = 1 , kz
    do i = ici1 , ici2
        do j = jci1 , jci2
        tnew = tx(j,i,k)
        dp = dpfs(j,i,k)
        qe = mo2mc%qdetr(j,i,k)

        if ( k > 1 ) then
            sumq0(j,i,k) = sumq0(j,i,k-1) ! total water
            sumh0(j,i,k) = sumh0(j,i,k-1) ! liquid water temperature
        end if

        tmpl = qx(iqql,j,i,k)+qx(iqqr,j,i,k)
        tmpi = qx(iqqi,j,i,k)+qx(iqqs,j,i,k)
        tnew = tnew - wlhvocp*tmpl - wlhsocp*tmpi
        sumq0(j,i,k) = sumq0(j,i,k)+(tmpl+tmpi+qx(iqqv,j,i,k))*dp*regrav

        ! Detrained water treated here
        if ( lmicro .and. abs(qe) > activqx ) then
            sumq0(j,i,k) = sumq0(j,i,k) + qe*dp*regrav
            alfaw = qliq(j,i,k)
            tnew = tnew-(wlhvocp*alfaw+wlhsocp*(d_one-alfaw))*qe
        end if
        sumh0(j,i,k) = sumh0(j,i,k) + dp*tnew
        end do
    end do
end do

All the scalars that were being assigned to, i.e., tnew, dp, qe, tmpl and tmpi, were replaced with their vector versions.

do k = 1 , kz
    do i = ici1 , ici2
        !$omp simd simdlen(8)
        do j = jci1 , jci2
        tnew_expanded(j,i,k) = tx(j,i,k)
        dp_expanded(j,i,k) = dpfs(j,i,k)
        qe_expanded(j,i,k) = mo2mc%qdetr(j,i,k)

        if ( k > 1 ) then
            sumq0(j,i,k) = sumq0(j,i,k-1) ! total water
            sumh0(j,i,k) = sumh0(j,i,k-1) ! liquid water temperature
        end if

        tmpl_expanded(j,i,k) = qx(iqql,j,i,k)+qx(iqqr,j,i,k)
        tmpi_expanded(j,i,k) = qx(iqqi,j,i,k)+qx(iqqs,j,i,k)
        tnew_expanded(j,i,k) = tnew_expanded(j,i,k) - wlhvocp*tmpl_expanded(j,i,k) - wlhsocp*tmpi_expanded(j,i,k)
        sumq0(j,i,k) = sumq0(j,i,k)+(tmpl_expanded(j,i,k)+tmpi_expanded(j,i,k)+qx(iqqv,j,i,k))*dp_expanded(j,i,k)*regrav

        ! Detrained water treated here
        if ( lmicro .and. abs(qe_expanded(j,i,k)) > activqx ) then
            sumq0(j,i,k) = sumq0(j,i,k) + qe_expanded(j,i,k)*dp_expanded(j,i,k)*regrav
            tnew_expanded(j,i,k) = tnew_expanded(j,i,k)-(wlhvocp*qliq(j,i,k)+wlhsocp*(d_one-qliq(j,i,k)))*qe_expanded(j,i,k)

        end if
        sumh0(j,i,k) = sumh0(j,i,k) + dp_expanded(j,i,k)*tnew_expanded(j,i,k)
        end do
    end do
end do

Similar changes have been performed for the variables zdelta and phases.

3. Restructured Loops for Efficiency

The structure of some loops has been modified to make the code more efficient. Consider the foolowing loop in the original code

do k = 2 , kz
    do i = ici1 , ici2
        do j = jci1 , jci2
            do kk = 2 , k
                if ( mc2mo%fcc(j,i,kk-1) > cldtopcf .and. &
                    mc2mo%fcc(j,i,kk)  <= cldtopcf ) then
                    cldtopdist(j,i,k) = cldtopdist(j,i,k) + mo2mc%delz(j,i,kk)
                end if
            end do
        end do
    end do
end do

which was restrutured in the following manner to avoid the extra computation taking place kz times for each combination of (i, j). The modified code

!dir$ vector always
cloud_sum_calc(:,:) = d_zero
!$omp parallel do
do k = 2 , kz
    do i = ici1 , ici2
        !$omp simd simdlen(8)
        do j = jci1 , jci2
            if ( mc2mo%fcc(j,i,k-1) > cldtopcf .and. &
                mc2mo%fcc(j,i,k)  <= cldtopcf ) then
                cloud_sum_calc(j,i) = cloud_sum_calc(j,i) + mo2mc%delz(j,i,k)
            end if
        end do
    end do
end do

!$omp parallel do
do k = 2 , kz
    do i = ici1 , ici2
        !$omp simd simdlen(8)
        do j = jci1 , jci2
            cldtopdist(j,i,k) = cloud_sum_calc(j, i)
        end do
    end do
end do

The modified code stores the sum values in a temporary array cloud_sum_calc first, which is then used to modify the cldtopdist array.

Correctness Validation

The team has ensured the correctness of the changes by comparing the output file generated by the modified implementation with the output file generated by the original implementation. The experiments were conducted on the PARAMSANGANAK supercomputer at IIT Kanpur. lrcemip_perturb was set to false to to disable any randomization, to check the validity of our output.

Build Script

source $PROJECT/RegCM-setvars.sh
source $PROJECT/IPM-setvars.sh
./configure CC=icc FC=ifort CXX=icpc MPICC=mpiicc MPIFC=mpiifort MPIF90=mpiifort CFLAGS="-g -O3" FCFLAGS="-g -O3 -qopenmp -diag-disable=10448 -qopenmp-simd -march=core-avx2 -align array64byte -assume contiguous_assumed_shape -assume contiguous_pointer"
make version
make install

Run Script

#!/bin/sh
#SBATCH -N 4
#SBATCH --error=err.out
#SBATCH --output=out.out
#SBATCH --time=01:00:00
#SBATCH --partition=RM

source /jet/packages/oneapi/v2023.2.0/setvars.sh
source $PROJECT/RegCM-setvars.sh
source $PROJECT/IPM-setvars.sh
cp $REGCM_ROOT/Testing/isc24.in .
cp $REGCM_ROOT/Testing/rcemip.in profile.in
mkdir output
ln -sf $REGCM_ROOT/bin/regcmMPIRCEMIP .
export OMP_NUM_THREADS=1
mpirun -ppn 128 ./regcmMPIRCEMIP ./isc24.in

Performance Improvements

We checked the performance of the application, specifically the nogtom module, by profiling it using VTune on PARAMSANGANK. Since the code in the module was a serial one, to check performance, we used 48 processes, all on one node, and checked the total compute time of the nogtom subroutine. The input files were altered to run for 1 day instead of 10 days in the original input file.
For the smaller input file isc24_small.in, we observed a performance improvement, speedup of about 112.3% from about 300 seconds to 267 seconds. The time data is the overall compute time of the nogtom subroutine for all the 48 processes. As we had expected from vectorization of intructions, we got much more performance improvement, speedup of about 123.1% on the larger input file, isc24.in, from 6331 seconds to 5143 seconds.

Submission for the Bonus Task

This pull request is the submission for the bonus task of RegCM in the Student Cluster Competition (SCC) at ISC'24 from Team ExaDecimals, IIT Kanpur.

The changes described above aim to improve the performance and efficiency of the mod_micro_nogtom module, while maintaining the correctness of the implementation. The team has put significant effort into optimizing the code and is confident that these changes will contribute to the overall performance of the RegCM model.

ICTP / RegCM

ISC24 bonus task by IIT Kanpur #44