Open bena-nasa opened 1 year ago
I did confirm that that running the same experiment with the "debug" build and the issue goes away. Here is the same picture after 24 hours of "IWP" with the debug build, clearly looks much more reasonable.
This is scary. It does seem to eliminate the suggestion by @wmputman that this is possibly due to an errant formula applying to the whole array.
I see 2 possibilities from this:
(2) is more plausible in my estimation.
I see some odd code in the interface:
From my reading of that, SINST
is only defined for PLmb
greater than 500 mb, below that it just does:
qcvarr8 = 2.0
but then later on this:
qcvarr8 = 0.67 -0.38*SINST + 4.96*xscale - 8.32*SINST*xscale
occurs for all pressures which would just replace the sub 500mb value. But since SINST
isn't defined for some cases, it might be filling with garbage.
Still, I can't see how this could cause the MPI layout to appear in the output. And I have no idea of qcvarr8
or relvarr8
matter later on or if they are just diagnostic.
It does look like QCVAR_EXP
is a possible export to look at, but I don't think it's one of the default ones. Maybe that looks weird?
@bena-nasa Tested setting SINST
to 0 in the <500 mb case but it didn't help the weird pattern. Still, @dbarahon should look at this code and try and figure out what it should be doing.
Another data point. I tried building this model version (v11.1.1 with donifan's branch) with gfortran. A c90 stock experiment (the 1 movement moist) runs fine. However, if I turn on the 2 moment physics it just crashes in the first run step with either the debug or optimized gfortran build, so apparently something in 2 moment is so screwy that even gfortran with no optimization can't handle it. Who knows if it related to the original issue at hand, but clearly another red flag that something is not well in this 2 moment code.
I see some odd code in the interface:
From my reading of that,
SINST
is only defined forPLmb
greater than 500 mb, below that it just does:
relvarr8 and qcvarr8 are not diagnostic, however the line qcvarr8 = min(max(qcvarr8, 0.5), 50.0) is a limiter. It might be that without setting a value to SINST, qcvarr8 becomes nan bypassing the line. However I haven't seen any weirdness in the QCVAR_EXP export. Initializing SINST to zero should fix it.
Summary of facts to the best of my knowledge as of November 2st, 2023: The main issue as well as the required tag (which specifies the compiler of course is in the first post for the issue so will repeat). But the TLDR is with Intel if you choose the release build with the 2 moment physics you get these weird patterns in certain moist variables. Another way to see this is that the model does not pass layout regression with the 2 moment microphysics.
Ways to fix this: I have found that building GEOS_MGB2_2M_InterfaceMod.F90 with -O0 seems to fix it. Also just commenting out mmciro_pcond/micro_mg_tend_interface (which I think ARE the 2 moment computational core) so you are just bypassing the code which "fixes" by virtue of the fact you don't exercise the code. In quick check of those interfaces (which is quite daunting because they have 100+ and 200+ arguments) I have not seen anything like a mismatched bounds for say center/edge variables.
The debug build appears to pass regression/not show these weird patterns.
What is confusing is that I took he debug build and selectively compiled just the moist library with the release flags and that did not show the bad behavior. So it is like memory is just being moved around it is not JUST that one file being built with optimization that can trigger this.
gfortran (using gcc 12.1 and openmpi 4.1.3): The code does appear to pass layout regression (which we assume means it is behaving right) with optimized gfortran. There is one caveat. By default when you choose the 2 moment physics, a timestep of 1800 seconds with an 1800 second component timestep for CHEMISTRY/GOCART/HEMCO/GF/UW. If you choose the 1 moment the default appears to be 450 second tilmestep with a 450 second for those components.
For some reason if you use this 1800 second tilmestep, the model just crashes in the dynamics right away so to do the gfortran tests I had to set the tilmestep back to 450 seconds; note that intel has no trouble with this timestep. I have made a separate issue for that: https://github.com/GEOS-ESM/GEOSgcm_GridComp/issues/850
@wmputman @dbarahon Is the 2 moment broken in some tags? If use 11.3.1 (one of the latest model releases) and choose the 2 moment option in gcm_setup. NOTHING runs. I've tried even the debug build with our stock testing restarts at c90 and c24 and even the debug build just segfaults in the first timestep in that 2 moment code in both cases.
I've been trying it outside of the 11.1.1 + donifan's 2 moment branch, to see if that also shows this but at least in 11.3.1 if you choose 2 moment, seems not to run.
my earlier observation was not correct that there was an out of bounds issue with c24 (I had done something to disable the callback when I was futzing with the component testing frame that was causing that erroneous error). I was able to setup a c24 experiment following Donifan's c90 using v11.1.1 + donifan 2 moment branch and that also does not pass layout regress.
So at least the "2 moment problem" seems to be reproducible at lower resolution with specific model release/branch version I have been testing with. The 2 moment in v11.3.1 is still broken though it seems...
A fear more data points.
I redid the only build most with optimization, and build the rest of geos with the debug flags. Unfortunately I must not have done the test right before. The rational with this test would be that if this is a compiler optimization problem this executable would also not pass layout regression. This is with v11.1.1 and the donifan branch. Unfortunately his crashes with this error even before reaching the 2 moment.
libGEOSmoist_Grid 00002AF037985384 convpar_gf2020_mp 2435 ConvPar_GF2020.F90
libGEOSmoist_Grid 00002AF037977F7E convpar_gf2020_mp 1591 ConvPar_GF2020.F90
libGEOSmoist_Grid 00002AF037966332 convpar_gf2020_mp 633 ConvPar_GF2020.F90
libGEOSmoist_Grid 00002AF0378DD22F geos_gf_interface 517 GEOS_GF_InterfaceMod.F90
libGEOSmoist_Grid 00002AF037A61AEE geos_moistgridcom 5311 GEOS_MoistGridComp.F90
which implies there's even more issues here.
2nd data point. Apparently all these changes to the 2 moment have NOT been merged into any stable tag to the best of my knowledge and that the 2 moment option in any v11.x.y tag should be consider non-functional. One must either use the donifan branch referenced in the first post OR one must use the feature/wmputman/hwt_spring_exp
tag of @wmputman. I tried cloning v11.3.2 (the last GEOSgcm fixture release) and updating the GEOSgcm_GridComp to feature/wmputman/hwt_spring_exp
which at the time of cloning was on commit 5cd9926e381b51e72f1edddbf8cd779c8e02a495. I did the standard release build, then used that executable in the same experiment I have been running all along at c24. That also did not pass layout regression so clearly this branch has the same issue.
I am concerned that no one can give me stable tag, only branches to reproduce this which of course can change. Hence why I have been referencing the SHA codes in this issue.
and yet another disturbing thing. As I reported in the previous post, with @wmputman branch I can reproduce the problem we are after using the release build. BUT, if I use the DEBUG build of the model configuration as described in the previous post the model crashes like so with a floating point exception rather than running and passing layout regression like Donifan's branch does and crashing in ConvPar_GF2020.F90
I tried cloning v11.3.2. Running MG1 out of the box without updates, setting CLDMICR_OPTION: MGB2_2M, MGVERSION: 1, CONVPAR_OPTION: GF, USE_GF2020: 1 and USE_FCT: 0., does not show the weird behavior. This seems to conflict with @bena-nasa results. The run can be found at /gpfsm/dnb34/dbarahon/DEV/MG1_v1132 This would of course point to the latest updates as the source of the problem. However from my prior experience the layout issue would show up for seemingly unrelated changes like adding a constant or so to the code. In any case seems like the course of action would be to add the latest changes one by one and test every time.
@dbarahon @wputman so I thought that Bill's branch could reproduce the problem, well, I was using layout reproducibility as a proxy. Donifan's branch when showing these bad patterns was not layout reproducible. I assumed they were one and the same. So it seems the experiment run with bill's branch, I'm not seeing the "grid" pattern when I actually looked.
So yay, but even after a step 2 runs at different layouts diverged, but they were diverging after GF runs, not the 2 moment code.
Likewise I also cloned stock v11.3.2 and used the same AGCM.rc settings as @dbarahon used in the experiment in the previous pos in my own c90 experiment. That also did not show any "grid" pattern in the moist fields but also did not pass layout regression so it is absolutely showing weird behavior.
So clearly there is another issue with this code since it absolutely should pass layout regression. The question is, is this related to the "grid pattern" problem, another manifestation of a deeper issue?
@wmputman this 2 moment code is not right that I'm testing with your branch (and that I guess Scott is making a release of). Besides the fact it doesn't layout reproduce, the debug build of the mode just crashes in GF and worse, if I take release build of the model, but say build just the moist library with debug build, it fails too, but fails differently also in GF! I'll pursue these, but in my mind I would consider the 2 moment option to be non-functional...
Update of situation as of November 22nd, 2023 as I am to see:
Scott has a released a new version of the GEOSgcm fixture v11.3.3, that has all of Bill's and Donifan's changes (Bill had broken layout regression in GF2020 too in his development branch which confused me for a while, but he fixed that in what was passed to Scott and GF2020 passes layout regression for me in this release) they wanted to see get into a release so that we can have a stable release to serve as the baseline to track the issue.
In the discussion below I will refer to the file GEOS_MGB2_2M_InterfaceMod.F90 as the "MG2 interface" for brevity. My test to find where the the code was diverging in different layouts was to add multiple calls to "MAPL_CheckpointState" and dump out the internal state of moist at various strategic points in the MG2 interface the first time the run method is executed. I simply ran the code at 2 layouts and found the first dumped checkpoint in the sequence that differed.
Using this release the following can be stated when MG2 is enabled, in a C90 experiment. All these tests were performed on either skylake or cascade lake nodes. I have not done this experiment on SCU17. With Intel (both compiler and mpi 2021.6.0, the stock compiler used by the v11.3.3 of the fixture on sles12)
With gfortran (openmpi 4.1.3, gcc 12.1.0, same baselibs version as in stock tag)
The summary, the new release does not pass layout regression when MG2 is turned, either because of the call to hystpdf or mmicro_pcond depending on the optimization/compiler.
I cannot reproduce this grid imprint Donifan was seeing, but in my mind this is irrelevant. The code is not passing layout regression with 2 different compilers. As far as where to investigate, mmicro_pcond seems to be the place (although the gfortran results suggest that perhaps the memory corruption may move).
I am concerned simply that the mmicro_pcond and micro_mg_tend_interface (an alternative to mmicro_pcond) have a ridiculous number of arguments. We have seen an intel bug that was related to the number of arguments and the shear number and the fact it is called in per-column I-j loop makes this incredibly hard to debug/investigate.
Update with v11.5.2 of the GEOSgcm fixture The stock tag at c24 still fails layout regression when MGB2_2M is selected as the microphysics using gcm_setup. Compiling GEOS_MGB2_2M_InterfaceMod.F90 at O0 still fixes the regression failure.
I guess the question is does @dbarahon have a newer MG code? Perhaps the v12 candidate has it? Or is there a new branch?
I think we should confirm this with @dbarahon before doing anything further.
Hi, Just merged my latest cleaned up version of MG. It must work with v11.5.2 and it is very close to what is implemented in Bill’s v12_rc1 tag.
Donifan
From: Ben Auer @.> Reply-To: GEOS-ESM/GEOSgcm_GridComp @.> Date: Wednesday, June 12, 2024 at 10:58 AM To: GEOS-ESM/GEOSgcm_GridComp @.> Cc: "Barahona, Donifan (GSFC-6101)" @.>, Mention @.***> Subject: [EXTERNAL] [BULK] Re: [GEOS-ESM/GEOSgcm_GridComp] Issue with 2 moment cloud microphysics, code shows grid or fails layout regression (but does not show grid pattern) (Issue #847)
CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.
Update with v11.5.2 The stock tag at c24 still fails layout regression when MGB2_2M is selected as the microphysics using gcm_setup. Compiling GEOS_MGB2_2M_InterfaceMod.F90 at O0 still fixes the regression failure.
— Reply to this email directly, view it on GitHubhttps://github.com/GEOS-ESM/GEOSgcm_GridComp/issues/847#issuecomment-2163260499, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM3UCEUUZLF6YMTRQRCUIPTZHBOZFAVCNFSM6AAAAABJGRQVIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTGI3DANBZHE. You are receiving this because you were mentioned.Message ID: @.***>
Just merged my latest updates. The stock code is very outdated. Instead checkout feature/donifan/mg3clean
@dbarahon my initial test seemed to say that your branch is now passing layout reproducibility but failing start/stop
@dbarahon my initial test seemed to say that your branch is now passing layout reproducibility but failing start/stop
Yes, that's what I am finding as well. Setting NCPL and NCPI to constant values makes it pass start-stop. Those are variables we want to be able to predict however.
@dbarahon that's weird, those are in the moist_internal_checkpoint. Also, they are internal, so I would have thought something you want to "predict" means you would want to write it in History which means it should be an export.
@dbarahon that's weird, those are in the moist_internal_checkpoint. Also, they are internal, so I would have thought something you want to "predict" means you would want to write it in History which means it should be an export.
I just meant that they can't be constant . They are part of the internal.
@dbarahon what did you do to make NCPL and NCPI constant? I'm not sure what you mean by that. I tried setting them to 0 in the run method of the mg2 code but that didn't fix the start/stop regression
@dbarahon what did you do to make NCPL and NCPI constant? I'm not sure what you mean by that. I tried setting them to 0 in the run method of the mg2 code but that didn't fix the start/stop regression
For some reason zero does not seem to be a good value. Uncomment Lines 1162 and 1163 in the run method.
! NCPL = 50.0e6
! NCPI = 1.0e6
It might as well be that it is not reproducible.
Well, I'm not sure it means much, but NCPL
and NCPI
are the only two internals in MG that have a default...other than Q
.
And it looks like in the Thompson microphysics they have defaults of 50e6 and 1e3 in there.
Do we know if the Thompson microphysics regresses?
Ug, so my first idea just makes things more confusing. Starting at 21z, I'll do a 6 hour run and 3+3 hour runs I'll write the import and internal state of moist at 0z in both at the start of run and see if they are different since start/stop is usually something is missing in the restart. Well both 0z restarts are the same, meaning after restarting the run at 0z, moist is getting an identical import and internal state coming in compared to a continuous run :(
I then tried also dumping these states at the END of the run method at 0. Now they have gone 0 diff. So despite seeing the see the same import and internal state coming in, they end up at a different place.
@dbarahon The internal state goes non-zero diff in the two experiments in these lines in the 2 moment:
!update Number concentrations
NCPL = NCPL + DNDCNV*DT_MOIST
NCPI = NCPI + DNICNV*DT_MOIST
Since NCPL and NCPI are in the internal restart, the only logical conclusion is that DNDCNV is different which I see is computed a few lines above. I need to leave for today but that's where I will investigate on Monday
I think I found it. DNDCNV is computed with these lines:
DNDCNV = CNV_NDROP*CNV_MFD*iMASS
DNICNV = CNV_NICE*CNV_MFD*iMASS
But CNV_MFD
is not set before being used. The pointer is retrieved earlier but it's never set to anything before this. That will cause a start/stop failure.
@dbarahon says CNV_MFD is filled in by GF. So either either GF didn't fill it in or CNV_NDROP/NCV_NICE has gone non-zero diff.
As an aside. The fact you are using the export state here, that someone else had to fill, that's just incredibly confusing... Also the alloc=.true. in the get pointer for CNV_MFD can't be necessary. If it is, then that's just a flat out logic bug since it would result in an uninitialized variable the first time when run.
Dang. I was hoping it was found. I'm still trying to follow the code around, but no luck. But at least there's a point to focus on!
So far, one interesting thing is that CDNC_NUC
is set to a constant after it was calculated a bit above that.
So far, one interesting thing is that
CDNC_NUC
is set to a constant after it was calculated a bit above that.
It is not supposed to be. I was testing whether the issue was in the aerosol_activate code. Must have forgotten to remove it.
@dbarahon says CNV_MFD is filled in by GF. So either either GF didn't fill it in or CNV_NDROP/NCV_NICE has gone non-zero diff.
As an aside. The fact you are using the export state here, that someone else had to fill, that's just incredibly confusing... Also the alloc=.true. in the get pointer for CNV_MFD can't be necessary. If it is, then that's just a flat out logic bug since it would result in an uninitialized variable the first time when run.
You were right. All that was needed was to verify that CNV_MFD and CNV_FICE were associated. I just updated my branch. It is passing both start-stop and layout regression for me.
I just updated my branch. It is passing both start-stop and layout regression for me.
Wooo! Congrats to both @bena-nasa and @dbarahon for figuring this out.
Hmm. I just tried a merge of feature/donifan/mg3clean
into feature/sdrabenh/gcm_v12-rc1
(so as to test with the v12 candidate and make sure it's happy there) and got:
/discover/swdev/mathomp4/Models/GEOSgcm-V12-2024Jun12/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSmoist_GridComp/aer_actv_single_moment.F90(7): error #6580: Name in only-list does not exist or is not accessible. [AERPROPSNEW]
USE aer_cloud, only: AerPropsNew
---------------------------^
Looks like I might need @dbarahon help to do this right. I must have screwed up the merge.
I concur, I pulled Donifan's feature/donifan/mg3clean
and now the start/stop tests that were failing on Friday are passing.
Hmm. I just tried a merge of
feature/donifan/mg3clean
intofeature/sdrabenh/gcm_v12-rc1
(so as to test with the v12 candidate and make sure it's happy there) and got:/discover/swdev/mathomp4/Models/GEOSgcm-V12-2024Jun12/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSmoist_GridComp/aer_actv_single_moment.F90(7): error #6580: Name in only-list does not exist or is not accessible. [AERPROPSNEW] USE aer_cloud, only: AerPropsNew ---------------------------^
Looks like I might need @dbarahon help to do this right. I must have screwed up the merge.
Hmm. I just tried a merge of
feature/donifan/mg3clean
intofeature/sdrabenh/gcm_v12-rc1
(so as to test with the v12 candidate and make sure it's happy there) and got:/discover/swdev/mathomp4/Models/GEOSgcm-V12-2024Jun12/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSmoist_GridComp/aer_actv_single_moment.F90(7): error #6580: Name in only-list does not exist or is not accessible. [AERPROPSNEW] USE aer_cloud, only: AerPropsNew ---------------------------^
Looks like I might need @dbarahon help to do this right. I must have screwed up the merge.
Just needs to add that new structure to aer_cloud.F90
Just needs to add that new structure to aer_cloud.F90
Might need you to try that. MG code is complex to me! :)
Just making an issue for this problem since it was not done (sorry I the title is not right, but I'm not an atmospheric scientist, all these difference "moist" schemes are just gobblygook to me). This is not a new issue, user @atrayano has been looking at it with fine tooth comb I'm told. Two users @dbarahon and @wmputman have been see "odd" patters in some of the moist exports when running the 2 moment moist physics (I assume it is two moment because they have this
CLDMICR_OPTION: MGB2_2M
in the AGCM.rc file. Both of them are seeing things like this it seems when examining output, this is "IWP" for example (this picture was made with ncview, using the "low" option for the scale, @dbarahon said this is the easiest way to spot this): Where the domain decomposition is clearly appearing in the field and just not physical (I don't know how else to describe this). In order to reproduce here is what you have to do. First checkoutv11.1.1
of the model. Then update the GEOSgcm_GridComp repo to thefeature/donifan/KrokMG3
branch. When I did the test the SHA code for GEOSgcm_GridComp I used was41d6f34f035193794b454c807e8a6a61bc7f9610
. Once built clone experiment on discover/gpfsm/dnb78s1/bmauer/donifan_weirdness
. The AGCM.rc has this for the most options which I assume is critical:Note I'm also told that @wmputman has feature branch that works off of develop branches off the GEOSgmc fixture:
feature/wmputman/hwt_spring_exp
but I don't want to record anything here for reproducibility that depends on a branch since branches change...I thought user @dbarahon claimed that downgrading the optimization to O1 in the moist component fixes (maybe I misunderstood) this but that's not what my testing showed so if there was a workaround that was not it. Whether I compiled those files in moist O3 or O1 I got these weird patterns clearly showing the domain.