GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
26 stars 17 forks source link

ESMF alarms in ExtData broken during replay. #470

Open bena-nasa opened 4 years ago

bena-nasa commented 4 years ago

Documenting this here in case others have any insights. I really want to just use alarms in ExtData but something just does not work with replay but in a simple code behaves as I expect.

I do the following experiment in a simple code. I create an ESMF_Clock with a start_time and a dt. For concreteness lets say the start time is some day as 21z and dt = 15 minutes. I create an ESMF_Alarm from the clock, with a ringtime=start_time, ringinterval=4*dt, and sticky=false I then start advancing the clock, checking if the alarm is ringing before advancing. The alarm reported as ringing at 21z, 22z, 23z, 00z, etc ... I then rewind the clock to 21z Then I advance the clock again, checking if the alarm is ringing before advancing it. I see the alarm reported as ringing at 21z, 22z, 23z, 00z, etc... This is what I would expect or at least the behavior I want.

Now in the model we do the same thing with replay.

Lets say I start at 21z, with a dt of 15 minutes. In ExtData, I create an alarm just like in the example above Now in when doing replay this is what happens: We start the model at 21z. then immediately go into the predictor loop in gcm. At the start of the loop we save the ringtime and alarm ringing state of every alarm in the clock. Then we run extdata, and agcm for 3 hours. I see the alarm in ExtData ringing at 21z, 22z, 23z, so far so good Then when the clock hits 0z we rewind. Then we set the ringtime and ringingstate we saved at the start, for each alarm in the clock. Now extdata runs to "reset" itself and the clock is at 21z, the alarm rings in ExtData, so seems right. Now the bad thing happens, we leave GCM and go back to cap. The clock ticks to 21:15, history runs, we go back to the top of the loop and ExtData runs and the clock at 21:15. Now at 21:15 the alarm is ringing again! At 21:30, 21:45 it is not ringing and it starts behaving normal again by ringing at at 22z, 23z, 00z, 1z, 2z,3z, then we do the replay cycle again and things repeat like this.

My first thought was, perhaps it is this resetting of the alarm that is causing weirdness. After all my simple test where I did not do this worked in what seemed like the logical manner. So I'll name the alarm in ExtData and not do this reset in gcm if it is this alarm from ExtData. So I do that but now things get even more screwed up. During the predictor phase the alarm rings at 21z, 22z, 23z, so far so good, but now when I rewind and we are back in the normal cap loop the alarm in extdata does not ring at 21z, 22z or 23z, it only starts ringing at 0z, 1z, 2z, 3z, then we start the cycle again.

I'm baffled, something subtle in how the clock is created in cap? This business we have to do where the gc of ExtData is stuff in the internal state of the gcm grid comp so it can invoke ExtData?

mathomp4 commented 4 years ago

Ouch. My head hurts reading that. Does any of this have to do with those Sticky Alarms? I never figured those out.

weiyuan-jiang commented 4 years ago

Do the ExtData and GCM run on the same set of processes? It seems to me the alarm is saved as a global object but may not collectively advance.

bena-nasa commented 4 years ago

A further observation from the last experiment I have done. I create an alarm in the gcm gridded component each time it is run. The ringtime is the current time of the clock and I give it a ring interval and make it sticky=.false. Once again the alarm does not behave correctly when rewinding in that the alarm I created attached to the clock does not ring when rewinding as my stand alone code does. However, if I create a copy of the clock in the run method and use that for my local alarm, the alarm does ring properly when rewinding. I also see that ESMF_GridCompRun says that the clock pass in will be treated as as read only by the child component. I'm wondering if there is something about the way they are passing the clock around that is the cause of this strange behavior. I will try setting a gridcomp clock in ExtData.

mathomp4 commented 3 years ago

@bena-nasa Is this all tied up in the "Alarms are broken in ESMF" issue?

mathomp4 commented 3 years ago

Reexamined on 2021-Apr-12 with @bena-nasa and @tclune

@atrayano Is this also seen in #796?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

mathomp4 commented 3 years ago

Pinging @bena-nasa and @atrayano : Is this still an issue? Or should it be fixed with that "Alarms in ESMF" branch the ESMF folks were having @bena-nasa test?

bena-nasa commented 3 years ago

I do not know, will have to check

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

stale[bot] commented 2 years ago

Closing due to inactivity

mathomp4 commented 2 years ago

I'm going to re-open this as well. I'd like a yes/no from @bena-nasa or @atrayano before we close this.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

mathomp4 commented 2 years ago

@bena-nasa @atrayano is this still a thing? Or has ESMF 8.2.0 fixed it?

atrayano commented 2 years ago

I am almost sure this is still an issue and not quite alright (at the very least I have not heard otherwise)

rsdunlapiv commented 2 years ago

ESMF alarms are being reworked and there was a meeting this morning that included @bena-nasa to discuss how the ESMF alarms API can be implemented to be semantically unambiguous. @feiliuesmf is leading the effort.

mathomp4 commented 2 years ago

@rsdunlapiv and @atrayano Thanks. I'll mark as long-term so the Stale Bot stays away. 😄

bena-nasa commented 2 years ago

I think the rewrite of the ESMF alarms done by ESMF will fix this in the long run.