This pull request includes the following major changes to the mod_micro_nogtom module:
1. Added Compiler Directives
The code optimization involved the incorporation of OpenMP directives to leverage SIMD instructions, which significantly improved its performance. OpenMP directives were strategically placed to enable vectorization, specifically using the !$omp simd directives. This allowed the compiler to efficiently process multiple data elements in parallel, resulting in a performance boost. The compiler vectorization report was a valuable resource during this process, providing insights into potential areas for optimization and guiding the placement of OpenMP directives.
The !dir$ ivdep directive was added to inform the compiler that there are no dependencies in vectorizing the instructions. This directive ensures that the compiler generates code that can be executed without any conflicts or dependencies between the instructions.
The !dir$ vector always directive was added above the initialization of matrices like sumh1(:,:,:) = d_zero to ensure that the compiler always vectorizes them.
The directive !dir$ novector was added above loops that iterated from 1 to nqx to instruct the compiler not to vectorize those loops. The decision to add this directive was based on the observation that nqx was relatively small (found to be 5), which meant that vectorizing these loops may incur a significant overhead that could potentially decrease performance.
We had also added !$omp parallel do directives to check if threading could bring any performance improvements, but eventually, it turned out that the overheads of threading outstanded the performance improvement. We did not remove these compiler directives, but we run the application after exporting OMP_NUM_THREADS = 1, which also makes these directives redundant.
2. Performed Scalar Expansion
Scalar expansion has been performed on several arrays to allow for better vectorization of the loops. The following arrays have been expanded:
tnew_expanded
dp_expanded
qe_expanded
tmpl_expanded
tmpi_expanded
zdelta_expanded
phases_expanded
This optimization technique helped vectorize some loops, which could otherwise hae not been vectorized, due to reasons of overwriting the scalar variable.
Consider the following loop in the original code
do k = 1 , kz
do i = ici1 , ici2
do j = jci1 , jci2
tnew = tx(j,i,k)
dp = dpfs(j,i,k)
qe = mo2mc%qdetr(j,i,k)
if ( k > 1 ) then
sumq0(j,i,k) = sumq0(j,i,k-1) ! total water
sumh0(j,i,k) = sumh0(j,i,k-1) ! liquid water temperature
end if
tmpl = qx(iqql,j,i,k)+qx(iqqr,j,i,k)
tmpi = qx(iqqi,j,i,k)+qx(iqqs,j,i,k)
tnew = tnew - wlhvocp*tmpl - wlhsocp*tmpi
sumq0(j,i,k) = sumq0(j,i,k)+(tmpl+tmpi+qx(iqqv,j,i,k))*dp*regrav
! Detrained water treated here
if ( lmicro .and. abs(qe) > activqx ) then
sumq0(j,i,k) = sumq0(j,i,k) + qe*dp*regrav
alfaw = qliq(j,i,k)
tnew = tnew-(wlhvocp*alfaw+wlhsocp*(d_one-alfaw))*qe
end if
sumh0(j,i,k) = sumh0(j,i,k) + dp*tnew
end do
end do
end do
All the scalars that were being assigned to, i.e., tnew, dp, qe, tmpl and tmpi, were replaced with their vector versions.
do k = 1 , kz
do i = ici1 , ici2
!$omp simd simdlen(8)
do j = jci1 , jci2
tnew_expanded(j,i,k) = tx(j,i,k)
dp_expanded(j,i,k) = dpfs(j,i,k)
qe_expanded(j,i,k) = mo2mc%qdetr(j,i,k)
if ( k > 1 ) then
sumq0(j,i,k) = sumq0(j,i,k-1) ! total water
sumh0(j,i,k) = sumh0(j,i,k-1) ! liquid water temperature
end if
tmpl_expanded(j,i,k) = qx(iqql,j,i,k)+qx(iqqr,j,i,k)
tmpi_expanded(j,i,k) = qx(iqqi,j,i,k)+qx(iqqs,j,i,k)
tnew_expanded(j,i,k) = tnew_expanded(j,i,k) - wlhvocp*tmpl_expanded(j,i,k) - wlhsocp*tmpi_expanded(j,i,k)
sumq0(j,i,k) = sumq0(j,i,k)+(tmpl_expanded(j,i,k)+tmpi_expanded(j,i,k)+qx(iqqv,j,i,k))*dp_expanded(j,i,k)*regrav
! Detrained water treated here
if ( lmicro .and. abs(qe_expanded(j,i,k)) > activqx ) then
sumq0(j,i,k) = sumq0(j,i,k) + qe_expanded(j,i,k)*dp_expanded(j,i,k)*regrav
tnew_expanded(j,i,k) = tnew_expanded(j,i,k)-(wlhvocp*qliq(j,i,k)+wlhsocp*(d_one-qliq(j,i,k)))*qe_expanded(j,i,k)
end if
sumh0(j,i,k) = sumh0(j,i,k) + dp_expanded(j,i,k)*tnew_expanded(j,i,k)
end do
end do
end do
Similar changes have been performed for the variables zdelta and phases.
3. Restructured Loops for Efficiency
The structure of some loops has been modified to make the code more efficient. Consider the foolowing loop in the original code
do k = 2 , kz
do i = ici1 , ici2
do j = jci1 , jci2
do kk = 2 , k
if ( mc2mo%fcc(j,i,kk-1) > cldtopcf .and. &
mc2mo%fcc(j,i,kk) <= cldtopcf ) then
cldtopdist(j,i,k) = cldtopdist(j,i,k) + mo2mc%delz(j,i,kk)
end if
end do
end do
end do
end do
which was restrutured in the following manner to avoid the extra computation taking place kz times for each combination of (i, j). The modified code
!dir$ vector always
cloud_sum_calc(:,:) = d_zero
!$omp parallel do
do k = 2 , kz
do i = ici1 , ici2
!$omp simd simdlen(8)
do j = jci1 , jci2
if ( mc2mo%fcc(j,i,k-1) > cldtopcf .and. &
mc2mo%fcc(j,i,k) <= cldtopcf ) then
cloud_sum_calc(j,i) = cloud_sum_calc(j,i) + mo2mc%delz(j,i,k)
end if
end do
end do
end do
!$omp parallel do
do k = 2 , kz
do i = ici1 , ici2
!$omp simd simdlen(8)
do j = jci1 , jci2
cldtopdist(j,i,k) = cloud_sum_calc(j, i)
end do
end do
end do
The modified code stores the sum values in a temporary array cloud_sum_calc first, which is then used to modify the cldtopdist array.
Correctness Validation
The team has ensured the correctness of the changes by comparing the output file generated by the modified implementation with the output file generated by the original implementation. The experiments were conducted on the PARAMSANGANAK supercomputer at IIT Kanpur. lrcemip_perturb was set to false to to disable any randomization, to check the validity of our output.
Build Script
source $PROJECT/RegCM-setvars.sh
source $PROJECT/IPM-setvars.sh
./configure CC=icc FC=ifort CXX=icpc MPICC=mpiicc MPIFC=mpiifort MPIF90=mpiifort CFLAGS="-g -O3" FCFLAGS="-g -O3 -qopenmp -diag-disable=10448 -qopenmp-simd -march=core-avx2 -align array64byte -assume contiguous_assumed_shape -assume contiguous_pointer"
make version
make install
We checked the performance of the application, specifically the nogtom module, by profiling it using VTune on PARAMSANGANK. Since the code in the module was a serial one, to check performance, we used 48 processes, all on one node, and checked the total compute time of the nogtom subroutine. The input files were altered to run for 1 day instead of 10 days in the original input file.
For the smaller input file isc24_small.in, we observed a performance improvement, speedup of about 112.3% from about 300 seconds to 267 seconds. The time data is the overall compute time of the nogtom subroutine for all the 48 processes.
As we had expected from vectorization of intructions, we got much more performance improvement, speedup of about 123.1% on the larger input file, isc24.in, from 6331 seconds to 5143 seconds.
Submission for the Bonus Task
This pull request is the submission for the bonus task of RegCM in the Student Cluster Competition (SCC) at ISC'24 from Team ExaDecimals, IIT Kanpur.
The changes described above aim to improve the performance and efficiency of the mod_micro_nogtom module, while maintaining the correctness of the implementation. The team has put significant effort into optimizing the code and is confident that these changes will contribute to the overall performance of the RegCM model.
Detailed Changes in the Pull Request
This pull request includes the following major changes to the
mod_micro_nogtom
module:1. Added Compiler Directives
The code optimization involved the incorporation of OpenMP directives to leverage
SIMD
instructions, which significantly improved its performance. OpenMP directives were strategically placed to enable vectorization, specifically using the!$omp simd
directives. This allowed the compiler to efficiently process multiple data elements in parallel, resulting in a performance boost. The compiler vectorization report was a valuable resource during this process, providing insights into potential areas for optimization and guiding the placement of OpenMP directives.The
!dir$ ivdep
directive was added to inform the compiler that there are no dependencies in vectorizing the instructions. This directive ensures that the compiler generates code that can be executed without any conflicts or dependencies between the instructions.The
!dir$ vector
always directive was added above the initialization of matrices likesumh1(:,:,:) = d_zero
to ensure that the compiler always vectorizes them.The directive
!dir$ novector
was added above loops that iterated from1
tonqx
to instruct the compiler not to vectorize those loops. The decision to add this directive was based on the observation that nqx was relatively small (found to be5
), which meant that vectorizing these loops may incur a significant overhead that could potentially decrease performance.We had also added
!$omp parallel do
directives to check if threading could bring any performance improvements, but eventually, it turned out that the overheads of threading outstanded the performance improvement. We did not remove these compiler directives, but we run the application after exportingOMP_NUM_THREADS = 1
, which also makes these directives redundant.2. Performed Scalar Expansion
Scalar expansion has been performed on several arrays to allow for better vectorization of the loops. The following arrays have been expanded:
tnew_expanded
dp_expanded
qe_expanded
tmpl_expanded
tmpi_expanded
zdelta_expanded
phases_expanded
This optimization technique helped vectorize some loops, which could otherwise hae not been vectorized, due to reasons of overwriting the scalar variable.
Consider the following loop in the original code
All the scalars that were being assigned to, i.e.,
tnew
,dp
,qe
,tmpl
andtmpi
, were replaced with their vector versions.Similar changes have been performed for the variables
zdelta
andphases
.3. Restructured Loops for Efficiency
The structure of some loops has been modified to make the code more efficient. Consider the foolowing loop in the original code
which was restrutured in the following manner to avoid the extra computation taking place
kz
times for each combination of(i, j)
. The modified codeThe modified code stores the sum values in a temporary array
cloud_sum_calc
first, which is then used to modify thecldtopdist
array.Correctness Validation
The team has ensured the correctness of the changes by comparing the output file generated by the modified implementation with the output file generated by the original implementation. The experiments were conducted on the PARAMSANGANAK supercomputer at IIT Kanpur.
lrcemip_perturb
was set tofalse
to to disable any randomization, to check the validity of our output.Build Script
Run Script
Performance Improvements
We checked the performance of the application, specifically the
nogtom
module, by profiling it usingVTune
onPARAMSANGANK
. Since the code in the module was a serial one, to check performance, we used 48 processes, all on one node, and checked the total compute time of thenogtom
subroutine. The input files were altered to run for1
day instead of10
days in the original input file.For the smaller input file
isc24_small.in
, we observed a performance improvement, speedup of about112.3%
from about300
seconds to267
seconds. The time data is the overall compute time of the nogtom subroutine for all the48
processes. As we had expected from vectorization of intructions, we got much more performance improvement, speedup of about123.1%
on the larger input file,isc24.in
, from6331
seconds to5143
seconds.Submission for the Bonus Task
This pull request is the submission for the bonus task of RegCM in the Student Cluster Competition (SCC) at ISC'24 from
Team ExaDecimals, IIT Kanpur
.The changes described above aim to improve the performance and efficiency of the
mod_micro_nogtom
module, while maintaining the correctness of the implementation. The team has put significant effort into optimizing the code and is confident that these changes will contribute to the overall performance of the RegCM model.