Bonus Task: Optimization of Nogtom Module

Changes

Brief Expression For Loops

For nested loops, we use a more simple format. For example:

        do k = 1 , kz
          do i = ici1 , ici2
            do j = jci1 , jci2
              do n = 1 , nqx
                qxtendc(n,j,i,k) = mc2mo%qxten(j,i,k,n)
              end do
            end do
          end do
        end do

We use do concurrent to modify the aforementioned loop block into a simple and standardized expression.

         do concurrent( n = 1 : nqx,j = jci1 : jci2,i = ici1 : ici2, k = 1 : kz)
            qxtendc(n,j,i,k) = mc2mo%qxten(j,i,k,n)
         end do

For elementwise operations with neat subscripts, such as assignment or calculation at the same position in right and left array, we can have a more concise method. For instance:

        do k = 1 , kz
          do i = ici1 , ici2
            do j = jci1 , jci2
              ttendc(j,i,k) = mc2mo%tten(j,i,k)
            end do
          end do
        end do

We use the modern Fortran method of array slicing assignment to modify the aforementioned loop block into a concise and standardized expression, as follows:

ttendc(jci1:jci2,ici1:ici2,1:kz) = mc2mo%tten(jci1:jci2,ici1:ici2,1:kz)

Similarly, for more complex cases of array slicing assignment, such as

      ! Delta pressure
      do k = 1 , kz
        do i = ici1 , ici2
          do j = jci1 , jci2
            dpfs(j,i,k) = mo2mc%pfs(j,i,k+1)-mo2mc%pfs(j,i,k)
          end do
        end do
      end do

We modify it to the following code:

      ! Delta pressure
      do k = 1 , kz
         dpfs(jci1:jci2,ici1:ici2,k) = mo2mc%pfs(jci1:jci2,ici1:ici2,k+1)-mo2mc%pfs(jci1:jci2,ici1:ici2,k)
      end do

We have verified aforementioned usage of array slicing assignment in Test/Evaluation2, confirming the functional correctness of this method.

Remove Argsort

From line 1727 to 1790 in mod_micro_nogtom_old.F90 which is the target blocks, we find out that the argsort function is actually redundant by analysing the output of Test/Evaluation1, which is an experiment on verifing the correctness without argsort.dd

By analysing the output of Test/Evaluation1 , we can also prove that the output of argsort, i.e. iorder is useless among the target code blocks. We can also logically remove it step by step. For example:

            sinksum(:) = d_zero
            !----------------
            ! recalculate sum
            !----------------
            do n = 1 , nqx
              do jn = 1 , nqx
                jo = iorder(n)
                lind2(jo,jn) = qsexp(jo,jn) < d_zero
                sinksum(jo) = sinksum(jo) - qsexp(jo,jn)
              end do
            end do
            !---------------------------
            ! recalculate scaling factor
            !---------------------------
            do n = 1 , nqx
              jo = iorder(n)
              ratio(jo) = max(qx0(jo),verylowqx) / &
                 max(sinksum(jo),max(qx0(jo),verylowqx))
            end do

For the aforementioned code, the operation on sinksum recalculates sinksum in the order of iorder. Notice that the assignment of the loop:

            do n = 1 , nqx
              do jn = 1 , nqx
                jo = iorder(n)
                lind2(jo,jn) = qsexp(jo,jn) < d_zero
                sinksum(jo) = sinksum(jo) - qsexp(jo,jn)
              end do
            end do

where lind2(jo,jn) and sinksum(jo) are independent from the order in which jo iterates. So we can directly remove jo and modify the block into following code:

            do n = 1 , nqx
              do jn = 1 , nqx
                lind2(n,jn) = qsexp(n,jn) < d_zero
                sinksum(n) = sinksum(n) - qsexp(n,jn)
              end do
            end do

Since the method of calculating sinksum has no change on its value, the calculation turns out to be redundant. The same applies to ratio. Therefore, the target blocks can be simplified to lind2=qsexp<d_zero, and the calculations for sinksum and ratio are removed.

For the loop:

            !------
            ! scale
            !------
            do n = 1 , nqx
              do jn = 1 , nqx
                jo = iorder(n)
                if ( lind2(jo,jn) ) then
                  qsexp(jo,jn) = qsexp(jo,jn)*ratio(jo)
                  qsexp(jn,jo) = qsexp(jn,jo)*ratio(jo)
                end if
              end do
            end do

It can also be proven that this symmetric position assignment method of qsexp(jo,jn) and qsexp(jn,jo) is also independent from the order of row traversal jo = iorder(n). Therefore, jo can be directly simplified to n, and the code is modified to

            !------
            ! scale
            !------
            do n = 1 , nqx
              do jn = 1 , nqx
                if ( lind2(n,jn) ) then
                  qsexp(n,jn) = qsexp(n,jn)*ratio(n)
                  qsexp(jn,n) = qsexp(jn,n)*ratio(n)
                end if
              end do
            end do

Multiplication Optimization

As for following block：

        do k = 1 , kz
          do i = ici1 , ici2
            do j = jci1 , jci2
              dp = dpfs(j,i,k)
              dtgdp = dt*egrav/dp
              rain = d_zero
              rainh = d_zero
              do n = 1 , nqx
                rain = rain + dt*pfplsx(n,j,i,k+1)
                if ( iphase(n) == 1 ) then
                  rainh = rainh+wlhvocp*dtgdp*pfplsx(n,j,i,k+1)*dp
                else if ( iphase(n) == 2 ) then
                  rainh = rainh+wlhsocp*dtgdp*pfplsx(n,j,i,k+1)*dp
                end if
              end do
              sumq1(j,i,k) = sumq1(j,i,k) + rain
              sumh1(j,i,k) = sumh1(j,i,k) - rainh
            end do
          end do
        end do

For the aforementioned code, we note that in the accumulation processes of rainh, there is a expression dtgdp * pfplsx(n,j,i,k+1) * dp. Additionally, we find that dtgdp equals dt * egrav / dp. Therefore, by the commutative property of multiplication, dtgdp * dp equals dt * egrav / dp * dp, which simplifies to dt * egrav. Thus, the calculation of rainh can be simplified to rainh = rainh + wlhvocp * dt * egrav * pfplsx(n,j,i,k+1) and rainh = rainh + wlhsocp * dt * egrav * pfplsx(n,j,i,k+1). Furthermore, we can remove the temporary variables rain and rainh by directly using the finally assigned variables sumq1 and sumh1. By applying the do concurrent method, we optimized it into final version.

         do concurrent(j=jci1:jci2,i=ici1:ici2,k=1:kz)
            do n = 1 , nqx
               sumq1(j,i,k) = sumq1(j,i,k) + dt*pfplsx(n,j,i,k+1)
               if ( iphase(n) == 1 ) then
                  sumh1(j,i,k) = sumh1(j,i,k) -wlhvocp*dt*egrav*pfplsx(n,j,i,k+1)
               else if ( iphase(n) == 2 ) then
                  sumh1(j,i,k) = sumh1(j,i,k) -wlhsocp*dt*egrav*pfplsx(n,j,i,k+1)
               end if
            end do
         end do

In Test/Evaluation3, we measured the performance, and this version is 60% faster than the original.

Results

For given input file ISC24.in:

Time spent on mod_micro_nogtom.F90 speeds up from 384s to 339s with about12% performance improvement.

And percent of time spent on mod_micro_nogtom.F90 decreases from 30.7% to 28.1%, profiled by Vtune.

Evaluation and Experiment

We do a lot to ensure we get correct answer and better performance.

Evaluation1: In this optimization, we remove function argsortand code blocks using the result of argsortiorder. We show how it works in mod_micro_nogtom.F90 and for random inputs there exists the same situation which parently shows that function argsortcan be deleted with no impact on correctness but save 10% time to run this file.

Evaluation2: In this optimization, we verify the correctness like aamax = maxval(abs(qlhs(:,n))) and

do k = 1 , kz 
sumh0(jci1:jci2,ici1:ici2,k) = sumh0(jci1:jci2,ici1:ici2,k)/mo2mc%pfs(jci1:jci2,ici1:ici2,k+1)
end do

Through the output, we show that the results are correct.

Evaluation3: In this experiment we speed up the block by logical analysis. We will show how it works in mod_micro_nogtom.F90 and test the speedup times. It turns out to be about 60% speedup along with logical optimization and removing local vars on the example blocks. We tried many ways to optimize including divide the independent outcomes to run seperately. Finally we apply a multiplication optimization and do concurren to speed up.

Experiment1: In this experiment we test on replacing function argsortwith Bubble Sort algorithm to Quick Sort algorithm. We show how it works better in mod_micro_nogtom.F90 only with nqx > 100. However, for current inputs with nqx = 5,7 or 10, there exists the same situation where argsortbecome much slower in Quick Sort for more complex computation. So function argsortcan't be replaced by QuickSort. Finally, we still show this test and hope that QuickSort can be a choice for future version when argsortis reused and nqx expanded to larger scale.

Experiment2: In this experiment we change the sequence of loop from k, i, j, n to n, k, i, j and use do concurrent properties to speed up. We switch n to the outside thus we decrease the time spent on if else statement. However it has no profit for the block along with do concurrent. We only replace it with do concurrent for better readability.

Experiment3: In this experiment, we use loop unrolling on the mysolve subroutine to speed up. We try to replace the nqx with its actual value, with input file ISC24.in nqx=5, this will help compiler to speed up. But it doesn't work well for small nqx scale.

To better contrast what we have done on the target file, you can choose mod_micro_nogtom.F90 and mod_micro_nogtom_old.F90 to scan for the same time

ICTP / RegCM

ISC24 Bouns task by SYSU #42