For nested loops, we use a more simple format. For example:
do k = 1 , kz
do i = ici1 , ici2
do j = jci1 , jci2
do n = 1 , nqx
qxtendc(n,j,i,k) = mc2mo%qxten(j,i,k,n)
end do
end do
end do
end do
We use do concurrent to modify the aforementioned loop block into a simple and standardized expression.
do concurrent( n = 1 : nqx,j = jci1 : jci2,i = ici1 : ici2, k = 1 : kz)
qxtendc(n,j,i,k) = mc2mo%qxten(j,i,k,n)
end do
For elementwise operations with neat subscripts, such as assignment or calculation at the same position in right and left array, we can have a more concise method. For instance:
do k = 1 , kz
do i = ici1 , ici2
do j = jci1 , jci2
ttendc(j,i,k) = mc2mo%tten(j,i,k)
end do
end do
end do
We use the modern Fortran method of array slicing assignment to modify the aforementioned loop block into a concise and standardized expression, as follows:
Similarly, for more complex cases of array slicing assignment, such as
! Delta pressure
do k = 1 , kz
do i = ici1 , ici2
do j = jci1 , jci2
dpfs(j,i,k) = mo2mc%pfs(j,i,k+1)-mo2mc%pfs(j,i,k)
end do
end do
end do
We modify it to the following code:
! Delta pressure
do k = 1 , kz
dpfs(jci1:jci2,ici1:ici2,k) = mo2mc%pfs(jci1:jci2,ici1:ici2,k+1)-mo2mc%pfs(jci1:jci2,ici1:ici2,k)
end do
We have verified aforementioned usage of array slicing assignment in Test/Evaluation2, confirming the functional correctness of this method.
Remove Argsort
From line 1727 to 1790 in mod_micro_nogtom_old.F90 which is the target blocks, we find out that the argsort function is actually redundant by analysing the output of Test/Evaluation1, which is an experiment on verifing the correctness without argsort.dd
By analysing the output of Test/Evaluation1 , we can also prove that the output of argsort, i.e. iorder is useless among the target code blocks. We can also logically remove it step by step. For example:
sinksum(:) = d_zero
!----------------
! recalculate sum
!----------------
do n = 1 , nqx
do jn = 1 , nqx
jo = iorder(n)
lind2(jo,jn) = qsexp(jo,jn) < d_zero
sinksum(jo) = sinksum(jo) - qsexp(jo,jn)
end do
end do
!---------------------------
! recalculate scaling factor
!---------------------------
do n = 1 , nqx
jo = iorder(n)
ratio(jo) = max(qx0(jo),verylowqx) / &
max(sinksum(jo),max(qx0(jo),verylowqx))
end do
For the aforementioned code, the operation on sinksum recalculates sinksum in the order of iorder. Notice that the assignment of the loop:
do n = 1 , nqx
do jn = 1 , nqx
jo = iorder(n)
lind2(jo,jn) = qsexp(jo,jn) < d_zero
sinksum(jo) = sinksum(jo) - qsexp(jo,jn)
end do
end do
where lind2(jo,jn) and sinksum(jo) are independent from the order in which jo iterates. So we can directly remove jo and modify the block into following code:
do n = 1 , nqx
do jn = 1 , nqx
lind2(n,jn) = qsexp(n,jn) < d_zero
sinksum(n) = sinksum(n) - qsexp(n,jn)
end do
end do
Since the method of calculating sinksum has no change on its value, the calculation turns out to be redundant. The same applies to ratio. Therefore, the target blocks can be simplified to lind2=qsexp<d_zero, and the calculations for sinksum and ratio are removed.
For the loop:
!------
! scale
!------
do n = 1 , nqx
do jn = 1 , nqx
jo = iorder(n)
if ( lind2(jo,jn) ) then
qsexp(jo,jn) = qsexp(jo,jn)*ratio(jo)
qsexp(jn,jo) = qsexp(jn,jo)*ratio(jo)
end if
end do
end do
It can also be proven that this symmetric position assignment method of qsexp(jo,jn) and qsexp(jn,jo) is also independent from the order of row traversal jo = iorder(n). Therefore, jo can be directly simplified to n, and the code is modified to
!------
! scale
!------
do n = 1 , nqx
do jn = 1 , nqx
if ( lind2(n,jn) ) then
qsexp(n,jn) = qsexp(n,jn)*ratio(n)
qsexp(jn,n) = qsexp(jn,n)*ratio(n)
end if
end do
end do
Multiplication Optimization
As for following block:
do k = 1 , kz
do i = ici1 , ici2
do j = jci1 , jci2
dp = dpfs(j,i,k)
dtgdp = dt*egrav/dp
rain = d_zero
rainh = d_zero
do n = 1 , nqx
rain = rain + dt*pfplsx(n,j,i,k+1)
if ( iphase(n) == 1 ) then
rainh = rainh+wlhvocp*dtgdp*pfplsx(n,j,i,k+1)*dp
else if ( iphase(n) == 2 ) then
rainh = rainh+wlhsocp*dtgdp*pfplsx(n,j,i,k+1)*dp
end if
end do
sumq1(j,i,k) = sumq1(j,i,k) + rain
sumh1(j,i,k) = sumh1(j,i,k) - rainh
end do
end do
end do
For the aforementioned code, we note that in the accumulation processes of rainh, there is a expression dtgdp * pfplsx(n,j,i,k+1) * dp. Additionally, we find that dtgdp equals dt * egrav / dp. Therefore, by the commutative property of multiplication, dtgdp * dp equals dt * egrav / dp * dp, which simplifies to dt * egrav.
Thus, the calculation of rainh can be simplified to rainh = rainh + wlhvocp * dt * egrav * pfplsx(n,j,i,k+1) and rainh = rainh + wlhsocp * dt * egrav * pfplsx(n,j,i,k+1).
Furthermore, we can remove the temporary variables rain and rainh by directly using the finally assigned variables sumq1 and sumh1. By applying the do concurrent method, we optimized it into final version.
do concurrent(j=jci1:jci2,i=ici1:ici2,k=1:kz)
do n = 1 , nqx
sumq1(j,i,k) = sumq1(j,i,k) + dt*pfplsx(n,j,i,k+1)
if ( iphase(n) == 1 ) then
sumh1(j,i,k) = sumh1(j,i,k) -wlhvocp*dt*egrav*pfplsx(n,j,i,k+1)
else if ( iphase(n) == 2 ) then
sumh1(j,i,k) = sumh1(j,i,k) -wlhsocp*dt*egrav*pfplsx(n,j,i,k+1)
end if
end do
end do
In Test/Evaluation3, we measured the performance, and this version is 60% faster than the original.
Results
For given input file ISC24.in:
Time spent on mod_micro_nogtom.F90 speeds up from 384s to 339s with about12% performance improvement.
And percent of time spent on mod_micro_nogtom.F90 decreases from 30.7% to 28.1%, profiled by Vtune.
Evaluation and Experiment
We do a lot to ensure we get correct answer and better performance.
Evaluation1: In this optimization, we remove function argsortand code blocks using the result of argsortiorder. We show how it works in mod_micro_nogtom.F90 and for random inputs there exists the same situation which parently shows that function argsortcan be deleted with no impact on correctness but save 10% time to run this file.
Evaluation2: In this optimization, we verify the correctness like aamax = maxval(abs(qlhs(:,n))) and
do k = 1 , kz
sumh0(jci1:jci2,ici1:ici2,k) = sumh0(jci1:jci2,ici1:ici2,k)/mo2mc%pfs(jci1:jci2,ici1:ici2,k+1)
end do
Through the output, we show that the results are correct.
Evaluation3: In this experiment we speed up the block by logical analysis. We will show how it works in mod_micro_nogtom.F90 and test the speedup times. It turns out to be about 60% speedup along with logical optimization and removing local vars on the example blocks. We tried many ways to optimize including divide the independent outcomes to run seperately. Finally we apply a multiplication optimization and do concurren to speed up.
Experiment1: In this experiment we test on replacing function argsortwith Bubble Sort algorithm to Quick Sort algorithm. We show how it works better in mod_micro_nogtom.F90 only with nqx > 100. However, for current inputs with nqx = 5,7 or 10, there exists the same situation where argsortbecome much slower in Quick Sort for more complex computation. So function argsortcan't be replaced by QuickSort. Finally, we still show this test and hope that QuickSort can be a choice for future version when argsortis reused and nqx expanded to larger scale.
Experiment2: In this experiment we change the sequence of loop from k, i, j, n to n, k, i, j and use do concurrent properties to speed up. We switch n to the outside thus we decrease the time spent on if else statement. However it has no profit for the block along with do concurrent. We only replace it with do concurrent for better readability.
Experiment3: In this experiment, we use loop unrolling on the mysolve subroutine to speed up. We try to replace the nqx with its actual value, with input file ISC24.in nqx=5, this will help compiler to speed up. But it doesn't work well for small nqx scale.
To better contrast what we have done on the target file, you can choose mod_micro_nogtom.F90 and mod_micro_nogtom_old.F90 to scan for the same time
Bonus Task: Optimization of Nogtom Module
Changes
Brief Expression For Loops
For nested loops, we use a more simple format. For example:
We use do concurrent to modify the aforementioned loop block into a simple and standardized expression.
For elementwise operations with neat subscripts, such as assignment or calculation at the same position in right and left array, we can have a more concise method. For instance:
We use the modern Fortran method of array slicing assignment to modify the aforementioned loop block into a concise and standardized expression, as follows:
Similarly, for more complex cases of array slicing assignment, such as
We modify it to the following code:
We have verified aforementioned usage of array slicing assignment in
Test/Evaluation2
, confirming the functional correctness of this method.Remove Argsort
From line 1727 to 1790 in
mod_micro_nogtom_old.F90
which is the target blocks, we find out that theargsort
function is actually redundant by analysing the output ofTest/Evaluation1
, which is an experiment on verifing the correctness withoutargsort.dd
By analysing the output of
Test/Evaluation1
, we can also prove that the output ofargsort
, i.e.iorder
is useless among the target code blocks. We can also logically remove it step by step. For example:For the aforementioned code, the operation on
sinksum
recalculatessinksum
in the order ofiorder
. Notice that the assignment of the loop:where
lind2(jo,jn)
andsinksum(jo)
are independent from the order in whichjo
iterates. So we can directly removejo
and modify the block into following code:Since the method of calculating
sinksum
has no change on its value, the calculation turns out to be redundant. The same applies toratio
. Therefore, the target blocks can be simplified tolind2=qsexp<d_zero
, and the calculations forsinksum
andratio
are removed.For the loop:
It can also be proven that this symmetric position assignment method of
qsexp(jo,jn)
andqsexp(jn,jo)
is also independent from the order of row traversaljo = iorder(n)
. Therefore,jo
can be directly simplified ton
, and the code is modified toMultiplication Optimization
As for following block:
For the aforementioned code, we note that in the accumulation processes of
rainh
, there is a expressiondtgdp * pfplsx(n,j,i,k+1) * dp
. Additionally, we find thatdtgdp
equalsdt * egrav / dp
. Therefore, by the commutative property of multiplication,dtgdp * dp
equalsdt * egrav / dp * dp
, which simplifies todt * egrav
. Thus, the calculation ofrainh
can be simplified torainh = rainh + wlhvocp * dt * egrav * pfplsx(n,j,i,k+1)
andrainh = rainh + wlhsocp * dt * egrav * pfplsx(n,j,i,k+1)
. Furthermore, we can remove the temporary variablesrain
andrainh
by directly using the finally assigned variablessumq1
andsumh1
. By applying thedo concurrent
method, we optimized it into final version.In
Test/Evaluation3
, we measured the performance, and this version is 60% faster than the original.Results
For given input file ISC24.in:
Time spent on mod_micro_nogtom.F90 speeds up from 384s to 339s with about12% performance improvement.
And percent of time spent on mod_micro_nogtom.F90 decreases from 30.7% to 28.1%, profiled by Vtune.
Evaluation and Experiment
We do a lot to ensure we get correct answer and better performance.
Evaluation1: In this optimization, we remove function
argsort
and code blocks using the result ofargsort
iorder. We show how it works in mod_micro_nogtom.F90 and for random inputs there exists the same situation which parently shows that functionargsort
can be deleted with no impact on correctness but save 10% time to run this file.Evaluation2: In this optimization, we verify the correctness like
aamax = maxval(abs(qlhs(:,n)))
andThrough the output, we show that the results are correct.
Evaluation3: In this experiment we speed up the block by logical analysis. We will show how it works in mod_micro_nogtom.F90 and test the speedup times. It turns out to be about 60% speedup along with logical optimization and removing local vars on the example blocks. We tried many ways to optimize including divide the independent outcomes to run seperately. Finally we apply a multiplication optimization and do concurren to speed up.
Experiment1: In this experiment we test on replacing function
argsort
with Bubble Sort algorithm to Quick Sort algorithm. We show how it works better in mod_micro_nogtom.F90 only with nqx > 100. However, for current inputs with nqx = 5,7 or 10, there exists the same situation whereargsort
become much slower in Quick Sort for more complex computation. So functionargsort
can't be replaced by QuickSort. Finally, we still show this test and hope that QuickSort can be a choice for future version whenargsort
is reused and nqx expanded to larger scale.Experiment2: In this experiment we change the sequence of loop from k, i, j, n to n, k, i, j and use do concurrent properties to speed up. We switch n to the outside thus we decrease the time spent on if else statement. However it has no profit for the block along with do concurrent. We only replace it with do concurrent for better readability.
Experiment3: In this experiment, we use loop unrolling on the mysolve subroutine to speed up. We try to replace the nqx with its actual value, with input file ISC24.in nqx=5, this will help compiler to speed up. But it doesn't work well for small nqx scale.
To better contrast what we have done on the target file, you can choose mod_micro_nogtom.F90 and mod_micro_nogtom_old.F90 to scan for the same time