GEOS-ESM / GOCART

GOCART Aerosol model including process library and framework interfaces (MAPL, NUOPC, and CCPP)
Apache License 2.0
14 stars 14 forks source link

0-increment Replays are not 0-diff #140

Closed sdrabenh closed 2 years ago

sdrabenh commented 2 years ago

@lltakacs and @sdrabenh confirmed 0-increment replays fail to be 0-diff if REPLAY_FILE_FREQUENCY is not the default 21600. This is important since OPS has moved to a 2-hourly 4D IAU. Tests using gcm v10.22.0 and after have the bug while v10.21.1 does not. The implication is that something is amiss in how GOCART-2G interacts with the IAU machinery.

In our C48 out-of-the-box case, a 6-hour AMIP was run as a control. Then, a 0-increment replay was run using the following:

    REPLAY_ANA_EXPID:    x0046a
    REPLAY_ANA_LOCATION: /discover/nobackup/projects/gmao/dadev/dao_it/archive/x0046a
    REPLAY_MODE:         Regular
    REPLAY_FILE:         ana/Y%y4/M%m2/x0046a.ana.eta.%y4%m2%d2_%h200z.nc4
    REPLAY_FILE_FREQUENCY:      21600
    REPLAY_FILE_REFERENCE_TIME: 000000
    REPLAY_P:  NO
    REPLAY_U:  NO
    REPLAY_V:  NO
    REPLAY_T:  NO
    REPLAY_QV: NO
    REPLAY_O3: NO
    REPLAY_TS: NO

The results from using the above configuration have 0-diff restarts compared to the AMIP. This is expected. However, if the following is changed ...

REPLAY_FILE_FREQUENCY: 7200

... 3 restarts become non-0-diff after the first time step. Specifically, there are the following differences:

cdo diffn scratch.amip/achem_internal_checkpoint scratch.0inc-x46a/achem_internal_checkpoint
               Date     Time   Level Gridsize    Miss    Diff : S Z  Max_Absdiff Max_Reldiff : Parameter name
    72 : 2015-05-09 21:07:30      72    13824       0    6484 : F T   3.2613e-16  3.6506e-05 : VOC        
   144 : 2015-05-09 21:07:30      72    13824       0     477 : F T   4.5103e-17  1.6128e-05 : VOCbiob    
  2 of 144 records differ
  0 of 144 records differ more than 0.001
cdo    diffn: Processed 3981312 values from 4 variables over 2 timesteps [0.12s 18MB].

cdo diffn scratch.amip/cabr_internal_checkpoint scratch.0inc-x46a/cabr_internal_checkpoint
               Date     Time   Level Gridsize    Miss    Diff : S Z  Max_Absdiff Max_Reldiff : Parameter name
    72 : 2015-05-09 21:07:30      72    13824       0     418 : F T   2.2204e-16  1.2083e-05 : CAphilicCA.br
  1 of 144 records differ
  0 of 144 records differ more than 0.001
cdo    diffn: Processed 3981312 values from 4 variables over 2 timesteps [0.10s 18MB].

cdo diffn scratch.amip/caoc_internal_checkpoint scratch.0inc-x46a/caoc_internal_checkpoint
               Date     Time   Level Gridsize    Miss    Diff : S Z  Max_Absdiff Max_Reldiff : Parameter name
    72 : 2015-05-09 21:07:30      72    13824       0    6240 : F T   1.7850e-15  0.00031656 : CAphilicCA.oc
  1 of 144 records differ
  0 of 144 records differ more than 0.001
cdo    diffn: Processed 3981312 values from 4 variables over 2 timesteps [0.12s 18MB].

At the end of a 6-hour window, all restarts become non-0-diff. This should not be the case.

mathomp4 commented 2 years ago

CC: @tclune @atrayano

This might be something we need to look at. Though we might need help from @pcolarco or someone who can help point us...

ETA: I added @atrayano since he probably knows the IAU machinery better than most!

lltakacs commented 2 years ago

A significant difference here is that 3D-Var IAU uses a single analysis replay file and creates a single increment file, while the 4D-EnVar IAU uses multiple analysis replay files and creates multiple increment files. This, however, was also the case for v10.21.1 which uses the original GOCART and worked just fine.

pcolarco commented 2 years ago

I don't know anything about the interface to the IAU machinery. If I read this correctly it is maybe the case that this was never tested with GOCART2G? Or is this a new feature with a model change that changes the IAU frequency? If I had to speculate here I would point my finger at the cycling update of some of the oxidant fields related to GOCART SU that are recycled with something like a three-hour frequency, and that maybe that gets broken with this change. Could you try your experiment without running SU and NI children to isolate that?

lltakacs commented 2 years ago

Hi Pete, I think you will need to tell us exactly how to do that. How do you turn off SU and NI? Larry

From: pcolarco @.> Sent: Tuesday, April 19, 2022 3:33 PM To: GEOS-ESM/GOCART @.> Cc: Takacs, Lawrence L. (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] @.>; Mention @.> Subject: [EXTERNAL] Re: [GEOS-ESM/GOCART] 0-increment Replays are not 0-diff (Issue #140)

I don't know anything about the interface to the IAU machinery. If I read this correctly it is maybe the case that this was never tested with GOCART2G? Or is this a new feature with a model change that changes the IAU frequency? If I had to speculate here I would point my finger at the cycling update of some of the oxidant fields related to GOCART SU that are recycled with something like a three-hour frequency, and that maybe that gets broken with this change. Could you try your experiment without running SU and NI children to isolate that?

- Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FGEOS-ESM%2FGOCART%2Fissues%2F140%23issuecomment-1103018400&data=05%7C01%7Clawrence.l.takacs%40nasa.gov%7C0dee10090e6d4249b56008da223b7306%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637859935992133425%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aTn74GRfDjFdLVRCkk7KzdL7u%2Fc%2FXAkD1ytErN1jruo%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAMRE3K7ZQ6RQLHQSPD3ACG3VF4C7RANCNFSM5TZSB2FA&data=05%7C01%7Clawrence.l.takacs%40nasa.gov%7C0dee10090e6d4249b56008da223b7306%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637859935992133425%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=s8qjplZDsvBkOS%2Fu0UixBprnMLbC09tzvjwjbGJgqWo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>

pcolarco commented 2 years ago

@lltakacs I think @sdrabenh knows how to do this, but in GOCART2g_GridComp.rc you should be able to change the line ACTIVE_INSTANCES_SU: SU #SU.data to ACTIVE_INSTANCES_SU:

and similar for the nitrate (ACTIVE_INSTANCES_NI)

And of course turn off the relevant variables in your HISTORY (we're just testing here)

mathomp4 commented 2 years ago

One note if you start doing all the tests: if you decide you want to run only NI, you technically have to run NI, SS, and DU since NI depends on SS and DU.

I found that out during my still unsuccessful attempts to try and figure out why GNU GOCART2G doesn't regress

sdrabenh commented 2 years ago

@pcolarco I re-ran both the amip and 7200 second 0-increment replay without SU and NI. This resulted in all restarts being non-0-diff after the first time step.

pcolarco commented 2 years ago

@sdrabenh For clarity for me: you got the correct results with GOCART2G and the older style 21600 second replay?

sdrabenh commented 2 years ago

@pcolarco you are correct. When using the 21600 second replay everything is 0-diff as expected. I also tested the 7200 second replay stopping at 00z (3-hour assimilation window instead of 6 hours) and the same 3 restarts are NZD.

pcolarco commented 2 years ago

Scott, could we have a call this afternoon? I can’t tell if this something where my help would actually be useful since I just don’t know this configuration you are trying to run.


Peter Colarco NASA GSFC Code 614 NASA Goddard Space Flight Center Greenbelt, MD 20771 301.614.6382 (ph) 301.614.5903 (fax)

@.**@.> http://acd-ext.gsfc.nasa.gov/People/Colarco http://www.researcherid.com/rid/D-8637-2012

From: Scott Rabenhorst @.> Reply-To: GEOS-ESM/GOCART @.> Date: Wednesday, April 20, 2022 at 10:21 AM To: GEOS-ESM/GOCART @.> Cc: Peter Colarco @.>, Mention @.***> Subject: [EXTERNAL] Re: [GEOS-ESM/GOCART] 0-increment Replays are not 0-diff (Issue #140)

@pcolarcohttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpcolarco&data=05%7C01%7Cpeter.r.colarco%40nasa.gov%7Cde451f83e0d34b16117608da22d90462%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637860612723620400%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3UQX2RCIcax%2FhXc0Zwzt28o6uhP%2BYJt%2BBqEcnkwsCUI%3D&reserved=0 you are correct. When using the 21600 second replay everything is 0-diff as expected. I also tested the 7200 second replay stopping at 00z (3-hour assimilation window instead of 6 hours) and the same 3 restarts are NZD.

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FGEOS-ESM%2FGOCART%2Fissues%2F140%23issuecomment-1103992139&data=05%7C01%7Cpeter.r.colarco%40nasa.gov%7Cde451f83e0d34b16117608da22d90462%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637860612723620400%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2BQhRTySQx6mS5i%2BIkyjzgP1ALICrMBDNTyJlXwt2XI%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANW73YCIR66LXMQCYIH2233VGAHFHANCNFSM5TZSB2FA&data=05%7C01%7Cpeter.r.colarco%40nasa.gov%7Cde451f83e0d34b16117608da22d90462%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637860612723620400%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DhnB0rRSUD1YE4larPxGkNOpPyD%2BxeA%2BGSJZkeSASHQ%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

sdrabenh commented 2 years ago

@pcolarco yes we could chat this afternoon. That would be helpful. I would also like to include @lltakacs since he is the IAU guru. Also, we have narrowed down the problem a bit further. The problem appears to occur when more than one forcing file is used in the assimilation/replay window. A replay file frequency of 7200, 10800, or 21600 doesn't matter as long as it is only 1 forcing file per window - similar to the 3D IAU M2 style. When more than one forcing files are used in the assimilation window as in the 4D IAU, that is when things become NZD. So GOCART2G may not be rewinding properly.

sdrabenh commented 2 years ago

For reference, my experiments/configs can be found here: /discover/nobackup/projects/gmao/g6dev/sdrabenh/test_exps/J220_C48_0inc_test scratch.amip/checkpoints_T1 and scratch.0inc-x46a.6h/checkpoints_T1 are 0-diff (1 forcing in 6hrs) scratch.amip/checkpoints_T1 and scratch.0inc-x46a.2h_7200/checkpoints_T1 are 0-diff (1 forcing in 2hrs) scratch.amip/checkpoints_T1 and scratch.0inc-x46a/checkpoints_T1 are NZD (3 forcings in 6hrs)

bena-nasa commented 2 years ago

@mathomp4 @sdrabenh @pcolarco @lltakacs I twiddled with this during a meeting. I was able to reproduce it with Scott's first instructions (however many forcings per replay period that is). I then tried turning ExtData off so no emission, then I did get zero diff, not sure if that is a clue or not. From what I see in the log, what ExtData is doing makes sense for the 6 hour/3 forcing case.

sdrabenh commented 2 years ago

@pcolarco @lltakacs @bena-nasa @mathomp4 I tried running another test turning off achem as suggested: ENABLE_ACHEM: .FALSE. in GEOS_ChemGridComp.rc Unfortunately, the following 2 fields were still NZD after the first time step:

cabr_internal_checkpoint.20150509_2107z.nc4
               Date     Time   Level Gridsize    Miss    Diff : S Z  Max_Absdiff Max_Reldiff : Parameter name
    72 : 2015-05-09 21:07:30      72    13824       0     535 : F T   3.3893e-12    0.024081 : CAphilicCA.br
  1 of 144 records differ
  0 of 144 records differ more than 0.001
cdo    diffn: Processed 3981312 values from 4 variables over 2 timesteps [0.02s 18MB].

caoc_internal_checkpoint.20150509_2107z.nc4
               Date     Time   Level Gridsize    Miss    Diff : S Z  Max_Absdiff Max_Reldiff : Parameter name
    72 : 2015-05-09 21:07:30      72    13824       0    6763 : F T   1.0608e-11     0.73855 : CAphilicCA.oc
  1 of 144 records differ
  0 of 144 records differ more than 0.001
cdo    diffn: Processed 3981312 values from 4 variables over 2 timesteps [0.03s 18MB].
pcolarco commented 2 years ago

I will try and have a look later at what this might be, I don't understand and it is interesting that evidently "bc" is not an issue since they are all using the same code.

sdrabenh commented 2 years ago

@pcolarco I also tried running with just DU and SS, but I can't seem to comment out ACTIVE_INSTANCES_CA: CA.oc CA.bc CA.br #CA.oc.data CA.bc.datac without the model crashing. I guess it has to run with carbon?

bena-nasa commented 2 years ago

@pcolarco there dependencies such that yes, something in dust I think, is expecting another gocart type (CA) to be on. If you look at where it crashed it is something like that, it was trying to get something that clearly would come from a different gocart instance for another type. Rather frustrating as kind of defeats the purpose of having these separate instances if they are dependent on each other. So you can run a single instance but it has to be the correct one or ones, there is at least one (sounds like dust) that you can't run by itself.

The other possibility is that you do need to comment out this line in the AGCM.rc #USE_AEROSOL_NN: 0 if you turn off any instances.

pcolarco commented 2 years ago

@sdrabenh Scott, I think you can't "#" comment out the active instances, you have delete them from the line. That is my experience thus far.

If you turn off the species, to @bena-nasa point, don't you have to "uncomment" #USE_AEROSOL_NN: 0 to get it to run properly? That is also my experience. This is the aerosol-cloud aware stuff.

I don't know why DU needs other components to run, assuming the above.

sdrabenh commented 2 years ago

@pcolarco you are correct. What I meant by commenting out was placing the comment after the active instances key so there were no values read in - not commenting the entire line.

# Include the constituent in the simulation?
# ----------------------------------------------------
ACTIVE_INSTANCES_DU:  DU #DU.test   
PASSIVE_INSTANCES_DU:

ACTIVE_INSTANCES_SS:   SS #  SS.data
PASSIVE_INSTANCES_SS:

ACTIVE_INSTANCES_SU:  #SU   #SU.data 
PASSIVE_INSTANCES_SU:

ACTIVE_INSTANCES_CA:   #CA.oc  CA.bc  CA.br  #CA.oc.data CA.bc.datac 
PASSIVE_INSTANCES_CA:

ACTIVE_INSTANCES_NI:  #NI  #NI.data 
PASSIVE_INSTANCES_NI:

Sorry for the confusion. @bena-nasa you were correct - setting USE_AEROSOL_NN: 0 and removing all active instances except DU and SS will allow the gcm to run instead of crash. Unfortunately, everything becomes NZD when I do that. Not surprising. I'm not sure that test is very fruitful.

sdrabenh commented 2 years ago

@bena-nasa how did you turn off all ExtData?

mathomp4 commented 2 years ago

@bena-nasa how did you turn off all ExtData?

Not sure if @bena-nasa replied offline, but I usually do this by editing gcm_run.j to where the ExtData files are all concatenated and then after that adding:

echo "USE_EXTDATA: .FALSE." >> ExtData.rc

I do it this way because I often forget which RC file I edited 😄

pcolarco commented 2 years ago

@sdrabenh I have played with this a bit as we discussed Friday. Could you check you did your "no achem" test correctly? I believe the issue is not in GOCART2G but in ACHEM, which is exporting psoa_voc and psoa_voc_anth to GOCART. If you turn ACHEM off they are provided by ExtDat (/dev/null) and I think things regress. When I run ACHEM but zero out the oxidants (specifically "OH" in GEOSachem_ExtData.rc) it also seems to regress. So I think the issue is in how OH is scaled inside of ACHEM, which is done with respect to the time of day. I don't really understand what is preserved in the "rewind" and I haven't been able to isolate further. I suppose my next try would be just to apply the OH as imported and not scale to see if that regresses.

mathomp4 commented 2 years ago

Oh man. I wonder if this is some weird ESMF_Alarm thing. I know @atrayano and @bena-nasa have had "fun" with ESMF_Alarms. Maybe when we come back from replay things aren't ringing...or are ringing and shouldn't be?

Or maybe something the OH scaling needs isn't being preserved in the import/internal state?

lltakacs commented 2 years ago

Just remember that all this works fine in the version prior to introducing GOCART-2G.

pcolarco commented 2 years ago

@lltakacs You were not running/testing the ACHEM VOC mechanism in the legacy GOCART. It's not the GOCART2G itself. Turn off ACHEM and it works fine. I'm back on the case this morning.

bena-nasa commented 2 years ago

@pcolarco @sdrabenh @lltakacs Does this code in the ACHEM use any ESMF alarms? Please let me know, there are issues with ESMF alarms when you rewind the clock. In fact we have a hack in the GCM grid comp that is supposed to work around all the bugs they have but I've found that in certain case even this hack does not work and the ESMF alarm code is in such a state they are essentially rewriting it from scratch now due to pervasive issues I've uncovered.

pcolarco commented 2 years ago

@bena-nasa Yes, there is an alarm used. I see two places where it is accessed inside of ACHEM. One, in each of the Initialize, Run, and Finalize routines it is accessed through a call to the internal "extract" routine, which as far as I can tell is using the "ringinterval" to return the time step size; there are other grid parameters accessed there too. Two, it is also used in Run explicitly to see if it is ringing... if not ringing then you just exit Run_, otherwise you turn the alarm off and keep it running.

ACHEM has only a single run method (not two phase) so it should be executed in the same block of in the GEOS_ChemGridComp code where Run2 methods are executed (per the comments there). There is an alarm check in GEOS_ChemGridComp->Run->Run2 routine that turns the alarm off as soon as you check the alarm is ringing. I'm not sure if it's the same alarm as what is in ACHEM.

My suspicion though is it is not the alarm, which other than maybe controlling timestep size and execution of the run method appears to not be relevant to anything else. Instead, what I am finding is this:

mathomp4 commented 2 years ago

@pcolarco One question: does it regress if use_diurnal_cycle is .false.?

That loop looks...weird. Is there a reason AChem uses a different sun than the rest of GEOS (i.e., MAPL_GetSunInsolation?)

sdrabenh commented 2 years ago

Could you check you did your "no achem" test correctly? I believe the issue is not in GOCART2G but in ACHEM, which is exporting psoa_voc and psoa_voc_anth to GOCART. If you turn ACHEM off they are provided by ExtDat (/dev/null) and I think things regress.

@pcolarco I re-ran both my scratch.amip and scratch.0inc-x46a.no_achem experiments to be sure. However, I do not get 0-diff restarts by changing: ENABLE_ACHEM: .FALSE.

Obviously, there is no achem checkpoint to compare, but I still see these differences after first time step:

cabr_internal_checkpoint.20150509_2107z.nc4
               Date     Time   Level Gridsize    Miss    Diff : S Z  Max_Absdiff Max_Reldiff : Parameter name
    72 : 2015-05-09 21:07:30      72    13824       0     535 : F T   3.3893e-12    0.024081 : CAphilicCA.br
  1 of 144 records differ
  0 of 144 records differ more than 0.001
cdo    diffn: Processed 3981312 values from 4 variables over 2 timesteps [0.02s 18MB].

caoc_internal_checkpoint.20150509_2107z.nc4
               Date     Time   Level Gridsize    Miss    Diff : S Z  Max_Absdiff Max_Reldiff : Parameter name
    72 : 2015-05-09 21:07:30      72    13824       0    6763 : F T   1.0608e-11     0.73855 : CAphilicCA.oc
  1 of 144 records differ
  0 of 144 records differ more than 0.001
cdo    diffn: Processed 3981312 values from 4 variables over 2 timesteps [0.03s 18MB].

For me turning off achem alone does not allow successful regression

bena-nasa commented 2 years ago

@pcolarco @sdrabenh Do we have any reason to suspect the SU gridcomp in GOCART2G. I see that creates an alarm with in interval of 3 hours that gets used some conditional. It was in fact a 3 hourly alarm in Extdata that first exposed that the "hack" we do in gcmgridcomp did not work in all cases.

sdrabenh commented 2 years ago

@pcolarco @sdrabenh Do we have any reason to suspect the SU gridcomp in GOCART2G. I see that creates an alarm with in interval of 3 hours that gets used some conditional. It was in fact a 3 hourly alarm in Extdata that first exposed that the "hack" we do in gcmgridcomp did not work in all cases.

Not sure if SU would affect the BR and OC carbon first, but a 3-hourly alarm sounds problematic with a 7200 sec replay

pcolarco commented 2 years ago

@sdrabenh Please have a look at /discover/nobackup/pcolarco/c48F_v10p22p1_regress I did two pairs of tests, a baseline (amip, 0inc-x46a) and a "no achem" test (amip-no_achem, 0inc-x46a-no_achem) and saved the scratch directories similar to your approach. I stand by the results I have reported throughout based on runs with a slightly older tag (10.21.1 with GOCART2G in place): baseline reports the issue you are finding, no_achem does not

Whatever @bena-nasa thinks might be problematic about the SU alarm is not evident to me. It would've shown up in the no_achem case if it was problematic.

Everything I reported earlier is still saved in /discover/nobackup/pcolarco/c48F_v10p21p1_regress, where turned things on and off. The problem is somewhere in the evaluation of the OH field values inside ACHEM not regressing properly. I don't know what the problem is and am out of time today to work on it further. I will raise the issue of the solar_zenith_angle calls with others in the aerosol group. My understanding is what Anton implemented in ACHEM was pretty much a copy of what's in GOCART. Is it ideal that it is not what is used elsewhere in the model? Probably not, but it maybe predates what is used elsewhere. But since it seems to work in sulfate I'm not sure that's the root of the problem.

adarmenov commented 2 years ago

@pcolarco One question: does it regress if use_diurnal_cycle is .false.?

Good suggestion, Matt.

pcolarco commented 2 years ago

@mathomp4 @adarmenov No, that does not affect the outcome. It still does not regress in that case. See: /discover/nobackup/pcolarco/c48F_v10p22p1_regress/scratch.amip-no_diurnal and scratch.0inc-x46a-no_diurnal

pcolarco commented 2 years ago

And who is introducing the EMISSIONS = AMIP and overwrite all my RC files into the gcm_run.j script? I thought we were not going to do that!

mathomp4 commented 2 years ago

And who is introducing the EMISSIONS = AMIP and overwrite all my RC files into the gcm_run.j script? I thought we were not going to do that!

@pcolarco Nothing should be doing that. We do have code in gcm_run.h to alter the RC files for AMIP.20C, but if you are using the OPS emissions, then that code should be avoided.

pcolarco commented 2 years ago

And who is introducing the EMISSIONS = AMIP and overwrite all my RC files into the gcm_run.j script? I thought we were not going to do that!

@pcolarco Nothing should be doing that. We do have code in gcm_run.h to alter the RC files for AMIP.20C, but if you are using the OPS emissions, then that code should be avoided.

@mathomp4 I will raise this with @amdasilva. It is very common in the aerosol group (and CCM group, etc.) to run AMIP as opposed to OPS emissions. None of us want or benefit from this feature and it is a perpetual stumbling block (or was when it was MERRA-2)

mathomp4 commented 2 years ago

@pcolarco Running with AMIP should work. I mean, it's how I run all the time. (Well, I run AMIP.20C more often just because of when my standard set of restarts are for.)

The only time emissions RC files will "change" in the scratch dir is if you are in the AMIP.20C time frame.

In future ExtData2G, this will not happen as @bena-nasa has figured out ways to make the new resource yaml files indicate when to change from one set of emissions to another.

pcolarco commented 2 years ago

@pcolarco Running with AMIP should work. I mean, it's how I run all the time. (Well, I run AMIP.20C more often just because of when my standard set of restarts are for.)

The only time emissions RC files will "change" in the scratch dir is if you are in the AMIP.20C time frame.

In future ExtData2G, this will not happen as @bena-nasa has figured out ways to make the new resource yaml files indicate when to change from one set of emissions to another.

@mathomp4 Appreciate that change is coming (have heard of what Ben is doing, don't know the details). The current is problematic for example the tests I was doing, where I wanted to change the default configurations (i.e., turn ACHEM off). My read of gcm_run.j is that regardless of my time period (granted, within the modern era) my RC is being overwritten by whatever is in the source, not my local RC. That's the problem. My hack is to edit gcm_run.j and change EMISSIONS to something else, like XXXX. This solves my issue, but someone has to be reminded to do that; i.e., I have to stumble over confusing results and then isolate that my RC were not what was expected and slap my forehead that it's like the old MERRA-2 and edit accordingly.

sdrabenh commented 2 years ago

@mathomp4 I will raise this with @amdasilva. It is very common in the aerosol group (and CCM group, etc.) to run AMIP as opposed to OPS emissions. None of us want or benefit from this feature and it is a perpetual stumbling block (or was when it was MERRA-2)

@pcolarco this was an issue that needed to be addressed in the beta rollout of GOCART2G. @mathomp4 @amdasilva and myself implemented this workaround until ExtData2G is implemented. Obviously, we must support OPS style emissions for FP and FPP. We also must be able to run the model prior to Y2000 when OPS emissions do not exist. Since the emission sources are discontinuous throughout this span, this seemed like the easiest solution. I'm sure you recall extending all these emissions back to Y1979 for AMIP.20C emissions.

pcolarco commented 2 years ago

@sdrabenh I'm just being a PITA, I know. I'm going to circle back to the actual issue on this thread...

pcolarco commented 2 years ago

I think my testing has isolated that the issue is the filling of the OH field. I don't understand how this might be impacted in replay/rewind versus amip-style. I template what I think is going on in the Run method of ACHEM which is called every timestep.

My presumption then is that the issue is either in filling q_OH or the preservation of the qOH across the rewind. Does MAPL_GetPointer(import...) get called both in the advancing and after rewind with the same time information for the interpolation?

bena-nasa commented 2 years ago

@pcolarco ah see q_OH comes from ExtData, that should be updated to the current time when we rewind, if that weren't working, nothing would be passing regression.

bena-nasa commented 2 years ago

@pcolarco can you please elaborate what you mean by this: "specification of the OH value in the "use it" line does lead to proper regression."

bena-nasa commented 2 years ago

@sdrabenh @pcolarco well, this is weird, I put some prints in achem to print q_oh and qoh and from what I see, it does appear that after the rewind back to 21z, the q_OH which I think is coming from extdata is is not getting set back to the original value it was in the first execution of the predictor step. Even more confusing at the 21:15z step, they are back to matching! I'm very confused, If ExtData were really broken like this nothing would ever pass the 0 increment regressions. I'll dig into this more as it does seem like it may be ExtData related.

pcolarco commented 2 years ago

@bena-nasa Thank you, Ben. The only thing I can add that might help is that the import

is actually in two places in the code. In the first it is inside logic of "if(self%mam_chem)" which is not the configuration I am running. In the second place it is in the logic of "if(self%voc_chem)" which is what I am running. In the second case there is additional an "if(.not. associated(q_OH))" then import it. The logic all makes sense to me, and I did try commenting out the associated check to no help.

bena-nasa commented 2 years ago

@pcolarco @sdrabenh thanks to help from @lltakacs I think we figured out what is going wrong, for reasons I don't understand, but probably something we did to MAPL unintentionally, the analysis checkpoints created on disk when you do replay with more than 1 analysis per segment are getting all the variables from the agcm import state written, not just the variables being analyzed. The full import state of agcm includes all the variables going to ExtData. So when we run replay, it looks like these are being read from the analysis file in agcm and overwriting the values from ExtData.

At least now I can see the problem, hopefully solution will be quick now.

bena-nasa commented 2 years ago

@pcolarco @sdrabenh @lltakacs So I think the problem was in AChem afterall. None of the imports in ACHEM that come from ExtData have RESTART = MAPL_RestartSkip which prevents them from being written to the restart which would explain why they were all appearing on the checkpoints created during replay. I'll add this and see if it fixes the problem

bena-nasa commented 2 years ago

@pcolarco @amdasilva @mathomp4 @sdrabenh @lltakacs

So I've finally fixed this all up. There will be multiple PR's coming, have to fix both ACHEM and GOCART2G. AChem was missing the RESTART=MAPL_RestartSkip in the MAPL_ADDImportSpec calls for variables coming thru ExtData which was causing the regression failures and since achem was not on in the previous tags this was not noticed.

GOCART2G was also missing this, but since those are defined via the ACG, I had to update the StateSpec.rc files for all the components, make each line longer and more unwieldy for people who don't have really wide screens in those .rc files, rather than the .F90 files but, hey at least the ACG supported the "RESTART" option so don't make to make 3 PRs to fix this :) . It was just fortuitous luck based on the frequency of updates defined in the gocart2g ExtData files that this didn't cause a similar problem as seen in Achem.

GOCART1G, which is still being run for a few species was in some schizophrenic state where the imports (but not the exports, those were via the acg) were defined in the .F90's and those had RestartSkip so we were always good there.

Apparently HEMCO and TR which are also being run by default are also doing the right thing as no spurious variables were ending up in the agcm_checkpoints in replay.

bena-nasa commented 2 years ago

@adarmenov @amdasilva We will have to make this a hotfix in GOCART since this is a proper bug. Then get this back to develop whenever it works again ...