NOAA-GFDL / GFDL_atmos_cubed_sphere

The GFDL atmos_cubed_sphere dynamical core code
Other
56 stars 118 forks source link

Fix out-of-bounds access in module_diag_hailcast.F90 which crashes RRFS on WCOSS2. #308

Closed SamuelTrahanNOAA closed 9 months ago

SamuelTrahanNOAA commented 10 months ago

Description

Fixes a bug that can crash the RRFS ensembles. When KBAS=1, there's an out-of-bounds write in an array. That corrupts memory and occasionally crashes the model. Also updates the comments in that region of code and adds author information.

Fixes https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/issues/309

All changes are from @adams-selin

How Has This Been Tested?

Reran the failed case on Hera and WCOSS2 Dogwood. It passed. Also ran the regression tests on Hera.

Checklist:

Please check all whether they apply or not

SamuelTrahanNOAA commented 10 months ago

This is a draft because I discovered I used the wrong fix. I need to do this instead:

-    DO k=KBAS,nz
+    DO k=KBAS+1,nz
        RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
     ENDDO

I'm testing this now on WCOSS2 and Hera.

SamuelTrahanNOAA commented 10 months ago

The correct fix works too. This is ready for review.

SamuelTrahanNOAA commented 10 months ago

Indeed, I would like the original author to take a look at this.

ywangwof commented 10 months ago

@SamuelTrahanNOAA Your fix look good. Actually, I have a pull request (#258) to address this problem long time ago. I do not know how to push forward for its merging. It is because I do not have write permission and am also not familiar with the merging process. Anyway, if you can merge that pull request with this one, it will be the best. Note that the variable KBAS is not only used in this loop. That is why I insist the pull request #258 may be better. Thanks.

SamuelTrahanNOAA commented 10 months ago

SamuelTrahanNOAA Your fix look good. Actually, I have a pull request (#258) to address this problem long time ago. I do not know how to push forward for its merging. It is because I do not have write permission and am also not familiar with the merging process. Anyway, if you can merge that pull request with this one, it will be the best. Note that the variable KBAS is not only used in this loop. That is why I insist the pull request #258 may be better. Thanks.

@ywangwof I already have PRs for fv3atm and ufs-weather-model, so it is best if we merge any fixes into my PR. Otherwise, someone else will have to redo that extra work.

@lharris4 disagreed with your solution and wanted this instead:

https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/258/files#r1160886041

I think the most elegant way would be to change the loop to be k=2,nz, and then once cwitot .ge. 1.e-12 is satisfied you can exit the loop. The way it is written now, once KBAS is set to k < nz, the condition k .lt. KBAS will never again be satisfied and you don't need to keep iterating.

Regardless, you also need my fix. Otherwise, the buggy loop will apply the correction to one extra index.

    DO k=1,nz
...
       ELSE IF ((k.ge.KBAS+1).AND.(RWA_adiabat(k).ge.1.E-12)) THEN
          RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1)) - RWA_new(k-1)
          IF (RWA_new(k).LT.0) RWA_new(k) = 0.
       ENDIF
    ENDDO
...
    DO k=KBAS,nz
       RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
    ENDDO

Only indices KBAS+1 through nz are multiplied by h1d(k)-h1d(k-1) in the prior loop. The next loop will divide RWA_new(k)

ywangwof commented 10 months ago

SamuelTrahanNOAA Your fix look good. Actually, I have a pull request (#258) to address this problem long time ago. I do not know how to push forward for its merging. It is because I do not have write permission and am also not familiar with the merging process. Anyway, if you can merge that pull request with this one, it will be the best. Note that the variable KBAS is not only used in this loop. That is why I insist the pull request #258 may be better. Thanks.

@ywangwof I already have PRs for fv3atm and ufs-weather-model, so it is best if we merge any fixes into my PR. Otherwise, someone else will have to redo that extra work.

@lharris4 disagreed with your solution and wanted this instead:

https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/258/files#r1160886041

I think the most elegant way would be to change the loop to be k=2,nz, and then once cwitot .ge. 1.e-12 is satisfied you can exit the loop. The way it is written now, once KBAS is set to k < nz, the condition k .lt. KBAS will never again be satisfied and you don't need to keep iterating.

Do you read my reply to @lharris4 's comments?

Regardless, you also need my fix. Otherwise, the buggy loop will apply the correction to one extra index.

    DO k=1,nz
...
       ELSE IF ((k.ge.KBAS+1).AND.(RWA_adiabat(k).ge.1.E-12)) THEN
          RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1)) - RWA_new(k-1)
          IF (RWA_new(k).LT.0) RWA_new(k) = 0.
       ENDIF
    ENDDO
...
    DO k=KBAS,nz
       RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
    ENDDO

Only indices KBAS+1 through nz are multiplied by h1d(k)-h1d(k-1) in the prior loop. The next loop will divide RWA_new(k)

I did not get it. The original out-of-bound problem is caused by (k-1) when k = 1. If we limit that KBAS >=2, then k-1 will be guaranteed always >=1. What do I miss? Thanks.

SamuelTrahanNOAA commented 10 months ago

Your code fails unit analysis.

This comment says h1d has a unit. Specifically, meters. For the equations to make sense, the units must be consistent.

!!!!     h1d          height above sea level (m)

RWA_new indices from KBAS+1 onward are scaled by h1d(k)-h1d(k-1). AT k=KBAS, it is not scaled:

       IF (k.eq.KBAS) THEN
          RWA_new(k) = RWA_adiabat(k)
       ELSE IF ((k.ge.KBAS+1).AND.(RWA_adiabat(k).ge.1.E-12)) THEN
          RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1)) - RWA_new(k-1)
       ENDIF

The next loop removes that scaling:

    DO k=KBAS,nz
       RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
    ENDDO

It does so for KBAS, which has not been scaled.

Actually, now that I look closer, your units are inconsistent earlier.

At RWA_new(KBAS+1), you're adding numbers of two different units. The RWA_new(KBAS) is in units of RWA_adiabat while RWA_new(KBAS+1) is RWA_adiabat * meters + RWA_adiabat. You're adding meters to unitless, which is not valid.

       IF (k.eq.KBAS) THEN
          RWA_new(k) = RWA_adiabat(k)
          ! k==KBAS has units of RWA_adiabat
       ELSE IF ((k.ge.KBAS+1).AND.(RWA_adiabat(k).ge.1.E-12)) THEN
          RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1)) - RWA_new(k-1)
          ! Units at k=KBAS+1 are inconsistent
          ! RWA_new(k) = RWA_adiabat*meters + RWA_adiabat = RWA_adiabat * (meters + unitless)
       ENDIF
ywangwof commented 10 months ago

Thanks @SamuelTrahanNOAA. Now I see the problem.

So even with your fix, it still fails the unit analysis. I will propose a fix like the following together with my original fix to limit KBAS >= 2.

       IF (k.eq.KBAS) THEN
          RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1))
       ELSE IF ((k.ge.KBAS+1).AND.(RWA_adiabat(k).ge.1.E-12)) THEN
          RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1)) - RWA_new(k-1)
       ENDIF
    DO k=KBAS,nz
       RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
    ENDDO

BTW, the original author of this code is John Henderson. I just re-organized it for the UFS framework and made a few tests. I should NOT claim it is my code. Anyway, I do not know John's Github ID, but I will sent him an email to remind him to check the correctness of our fixes. Thanks again.

SamuelTrahanNOAA commented 10 months ago

@ywangwof - Can you describe the physical phenomena that can lead to KBAS=1?

We're debugging failures of the RRFS model which may be coming from bad initial conditions. Another scheme failed recently due to implausible near-surface conditions. If we know what to look for, it may pare down the possible causes.

SamuelTrahanNOAA commented 10 months ago

Also, that fix still isn't enough. The initial value of RWA_new is also in the wrong units:

RWA_new(k) = RWA(k)

It is only replaced later if RWA_adiabat(k) is not near zero:

ELSE IF ((k.ge.KBAS+1).AND.(RWA_adiabat(k).ge.1.E-12)) THEN
   RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1)) - RWA_new(k-1)

If I knew what the code was trying to do with RWA_new and RWA_adiabat, I could fix this. Alas, there isn't sufficient commenting for me to know without reading through several research papers.

adams-selin commented 10 months ago

I'm the original author of the HAILCAST code. John Henderson did the work of implementing it within FV3, but since this issue is well within the HAILCAST subroutine itself, I can help here. The change should be

-    DO k=KBAS,nz
+    DO k=KBAS+1,nz
+        IF (RWA_adiabat(k).ge.1.E-12) THEN
              RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
+        ENDIF
     ENDDO

The k=1,nz do loop at line 809 needs to remain the same.

RWA, RWA_new, and RWA_adiabat are all cloud liquid water mixing ratio profiles. RWA comes directly from the model. RWA_adiabat is a mixing ratio profile assuming there is no entrainment. It assumes the cloud water available at cloud base (RBAS), is only reduced aloft by bringing the air to saturation (RSA), or glaciation (icefactor). (Lines 815-834).

RWA_new is, essentially, RWA below cloud base, and RWA_adiabat at cloud base and above it, with the caveat that any moisture expended at height levels below the current one can't be included in RWA_adiabat. (Lines 837-842). At cloud base (k=KBAS), no moisture has been expended yet, so RWA_adiabat can't be multiplied by height levels at that point. Above cloud base, (k>=KBAS+1), the amount of moisture expended over a layer will be linearly dependent on the depth of the layer.

In terms of units, after line 842:

Thus, the only points where the height scaling (and extra units of m) need to be removed are from k=KBAS+1:nz, where RWA_adiabat(k).ge.1.E-12. The code changes I've pasted above account for that. (The extra check for RWA_adiabat is going to make very little difference numerically, but is good to add for physical completeness.)

ywangwof commented 10 months ago

Thanks @adams-selin. Just being curious, in what condition KBAS=1? clear sky or bad model states?

SamuelTrahanNOAA commented 10 months ago

RWA_new(KBAS) has units (kg/kg) RWA_new(KBAS+1:nz) has units (kg/kg)*m where RWA_adiabat is .ge. 1.E-12. Where RWA_adiabat is less than 1.E-12, RWA_new has units of (kg/kg), because it has been set to RWA at the top of the k=1,nz loop.

This means you have a unit mismatch at RWA_new(KBAS+1)

   RWA_new(k) = RWA_adiabat(k)*(h1d(k)-h1d(k-1)) - RWA_new(k-1)
   RWA_new(KBAS+1) = RWA_adiabat(k)*(h1d(KBAS+1)-h1d(KBAS)) - RWA_new(KBAS)
   RWA_new(KBAS+1) units = kg/kg * m - kg/kg = m - 1
   RWA_new(KBAS+1) units = 1 * m - 1
   RWA_new(KBAS+1) units = m - 1

You are subtracting a unitless number from meters. That propagates onward through the rest of the k indexes up to nz.

adams-selin commented 10 months ago

Thanks @adams-selin. Just being curious, in what condition KBAS=1? clear sky or bad model states?

When you have a cloud essentially on the ground. Rare, but I could potentially see it being possible with a low LCL, depending on how the model layer spacing is near the surface.

adams-selin commented 10 months ago

@SamuelTrahanNOAA, yes, you are correct there is a units mismatch there, depending on how you view it. However, I'm viewing the RWA_new(KBAS) term as a constant in the integration, where the summation over height levels is the integration.

This makes sense physically to me, if you imagine trying to calculate the amount of water retained at each point in a leaky box, with input into one end of the box by a hose. RWA_new(KBAS) is the amount of water coming in the box from the hose. RWA_new(KBAS+1) is the amount of water retained at several points along the length of the box, which is dependent on the amount of water that has leaked out before reaching that point....which is dependent on the length of the box over which the water has traveled to that point.

I think the units might not match up because I should be, technically, calculating a moisture flux from RBAS over a set unit of time. But that would need an updraft speed and more, so I'm just assuming the moisture input is constant.

SamuelTrahanNOAA commented 10 months ago

The unit mismatch is still a unit mismatch no matter how you view it.

Your constant of integration isn't constant. If you want it to be constant, you must multiply it by dh at every level before dividing by dh at the end.

After your integration, if all RWA_adiabat(k).ge.1.E-12, using pseudocode,

RWA_new(k) = (sum(RWA_adiabat(m)*(h1d(m)-h1d(m-1))
                      - RWA_new(KBAS), m=KBAS+1..k))
              / (h1d(m)-h1d(m-1))

Reduces to

RWA_new(k) = (sum(RWA_adiabat(m)*(h1d(m)-h1d(m-1)), m=KBAS+1..k))
                  / (h1d(m)-h1d(m-1))
            - (k-KBAS) * RWA_new(KBAS) / (h1d(k)-h1d(m-1))

Which means:

  1. The "constant" term varies in magnitude at each level.
  2. Your final sum is also inconsistent in units (index/height + unitless)

I don't have the equation you're trying to solve, but I'm guessing it is this:

RWA_new(z) = RWA(z) - integrate(d(RWA_adiabat(z))/dz where RWA_adiabat(z) > 1e-12, z=zbas...zend)

The point being, to remove cloud liquid water output at lower levels. There's some safeguards against removing negative moisture or dealing with near-zero values.

If so, the correct code would be:

   DO k=1,nz
       RWA_lower_dh(k) = 0
       ...
       IF ((k.ge.KBAS+1).AND.(RWA_adiabat(k).ge.1.E-12)) THEN
          ! Integrate RWA_adiabat from KBAS...k into RWA_lower_dh
          RWA_lower_dh(k) = MAX(0, RWA_adiabat(k)*(h1d(k)-h1d(k-1))) + RWA_lower_dh(k-1)
       ENDIF
   ENDDO

   DO k=KBAS+1,nz
      ! Subtract the integration from RWA to get RWA_new
      RWA_new(k) = RWA(k) - max(0, RWA_lower_dh(k) / (h1d(k)-h1d(KBAS)) + RWA_adiabat(KBAS))
   ENDDO
adams-selin commented 10 months ago

It's not a unit mismatch because at the cloud base only, (KBAS), RBAS is technically an amount of moisture that has been fluxed in across cloud base. I'm assuming an constant updraft speed of 1 m/s, a constant source RBAS mixing ratio of units kg/kg, and that it is happening over 1 sec of time. Normally one would think of fluxes in 3D, but remember we are in a 1D column here. That means, at the cloud base only, RBAS has units (kg kg-1)(m s-1)(s), or (kg kg-1)(m). Since, at the cloud base, RWA_new(KBAS) = RWA_adiabat(KBAS) = RWA(KBAS) = RBAS, that means RWA_new(KBAS) also has units of (kg kg-1)(m). Once we get above KBAS, however, that no longer holds true, and RWA_adiabat is in units of (kg kg-1) and needs the height multiplier. Hence the height factor removal only needs to be done to those points at KBAS+1 and above.

I will note I included this exact fix in a pull request for WRF that was approved ~6 months ago (https://github.com/wrf-model/WRF/pull/1877).

SamuelTrahanNOAA commented 10 months ago

@adams-selin - Your explanation satisfies me, and I had already changed my PR to match your prior comment.

Can you confirm my PR contains the right fix?

Once you confirm, I'll take the ufs-weather-model PR out of draft status so they can proceed with testing and merging.

bensonr commented 10 months ago

@adams-selin @SamuelTrahanNOAA - because of the discussion surrounding this logic, I'd like to see the relevant information be documented within the source code. Can you also ensure proper provenance (ownership/history) for the file?

SamuelTrahanNOAA commented 10 months ago

because of the discussion surrounding this logic, I'd like to see the relevant information be documented within the source code. Can you also ensure proper provenance (ownership/history) for the file?

Quite sensible. I'd like @adams-selin's permission before putting their name in the file.

Also, @adams-selin, can you provide an updated version of that region with explanatory comments? Otherwise, I'll give it my best try.

adams-selin commented 10 months ago

@SamuelTrahanNOAA Sure. The code in the PR request is correct. Here are the commenting changes I'd make:

  1. I'd ask that the following be added to the top of the file, at current line 4:

    CAM-HAILCAST software designed and tested by Rebecca Adams-Selin in collaboration with Conrad Ziegler at the National Severe Storms Laboratory. Documentation provided in Adams-Selin and Ziegler 2016 MWR; Adams-Selin et al. 2019 WF; Adams-Selin 2023 MWR; and Pounds et al. 2024 JAS.
  2. At current line 808, add the following commented explanation:

    RWA_adiabat is a mixing ratio profile calculated by assuming there is no entrainment. It assumes the cloud water available at cloud base (RBAS) is only reduced aloft by bringing the air in the column to saturation (RSA), or glaciation (icefactor). 
    RWA_new is, essentially, RWA below cloud base, and RWA_adiabat at cloud base and above it, with the caveat that any moisture expended at height levels below the current one can't be included in RWA_adiabat. 
  3. At current line 829, change the current 1-line comment to:

    Calculate the potential amount of liquid water available at each vertical level, assuming no entrainment.  The cloud water available at cloud base (RBAS), is only reduced aloft by bringing the air to saturation (RSA), or by glaciation (icefactor).
  4. At current lines 835-836, change the 2-line comment to:

    Remove cloud liquid water outputted at previous lower levels. At cloud base (KBAS), RWA_new(KBAS) = RWA_adiabat(KBAS) = RWA(KBAS) = RBAS. RBAS is technically an amount of moisture that has been fluxed in across cloud base. I'm assuming an constant updraft speed of 1 m/s, a constant source mixing ratio of units kg/kg, and that it is happening over 1 sec of time. Normally one would think of fluxes in 3D, but remember we are in a 1D column here. That means, at the cloud base only, RBAS has units (kg kg-1)(m s-1)(s), or (kg kg-1)(m). Once we get above KBAS, however, that no longer holds true, and RWA_adiabat is in units of (kg kg-1). We then need to multiply it by the height of the layer to determine the flux of liquid water "out of the column" (i.e., bringing previous layers to saturation) over that layer.
  5. Remove the commented-out old code at current lines 843-848.

  6. At line 851, before the changes in the PR, change the current 1-line comment to:

    Remove the height scaling factor from RWA_new at all k values above the cloud base (KBAS), to return RWA_new units to kg kg-1.
laurenchilutti commented 9 months ago

@adams-selin @ywangwof I am unable to officially add you as a reviewer on this PR, but would you please leave a comment stating if you approve these changes?

adams-selin commented 9 months ago

Yes, I approve these changes. Thanks.

laurenchilutti commented 9 months ago

@jkbk2004 This PR was discussed in today's UFS code management meeting and I think it is ready to merge as we have approval from the original author of the code now.

jkbk2004 commented 9 months ago

@jkbk2004 This PR was discussed in today's UFS code management meeting and I think it is ready to merge as we have approval from the original author of the code now.

@laurenchilutti Thanks for the note! I will follow up next week. @SamuelTrahanNOAA @hu5970 FYI

jkbk2004 commented 9 months ago

All tests are done at https://github.com/ufs-community/ufs-weather-model/pull/2064. @bensonr can you merge this pr?