GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
25 stars 18 forks source link

No overwriting of nc4 files! #2638

Closed rtodling closed 1 month ago

rtodling commented 8 months ago

At some point in the past a change was made so that MAPL (or CFIO) would crash if trying the model tried to overwrite an output.

I remember that when I first stumbled on this, I wasn't so convinced w/ the need for this. In response to that, a knob what put in (or the default was changed) to allow for overwrite.

I believe the latest version of MAPL now has it such that overwrite causes the model to crash. Can we revisit this again please? This is a very inconvenient future especially for debugging purposes.

Perhaps there is a flag I can add to AGCM.rc or HISTORY to tell MAPL/CFIO not to bother, is there?

tclune commented 8 months ago

Hi Ricardo. Our GCHM (GEOS-CHEM) colleagues have asked for similar. I thought we had enabled this, but have forgotten the details. I know at the low levels there is a switch, but do not know how/if it propagates from History.rc.

In the worst case a kludge to the interface should not be too hard.

I'm hoping that by "crash" you mean that the model trapped the exception and gave an informative message that it would not overwrite the file? For me that is level-0 and very high priority.

tclune commented 8 months ago

@rtodling See #2391. We have a global setting in History that allows noclobber. It is an open issue to do it per-collection.

Currently, by setting Allow_Overwrite: .true. you can allow every collection to clobber.

Please let us know if you need this per-collection, in which case we can raise the priority of the other ticket. Closing this one.

mathomp4 commented 8 months ago

Indeed, as @tclune says, you'd add to the top of history where the other global variables are:

VERSION: 1
EXPID:  f5295_fp
EXPDSC: f5295_fp__GEOSadas-5_29_5__agrid_C720__ogrid_C
EXPSRC: GEOSadas-5_29_5
Allow_Overwrite: .true.

that should allow overwriting of history. But, note it is global so every collection will be allowed to overwrite.

rtodling commented 8 months ago

Hi Guys, thanks for the reply on this. I will add the opt to the history. Many thanks.

Ricardo

mathomp4 commented 7 months ago

@rtodling Warning. This might not be working. I'm reopening this issue.

mathomp4 commented 7 months ago

@rtodling Can you tell us what version of MAPL you are using? We might need to go back in time and patch this once we know the fix

tclune commented 7 months ago

Looks like the global option is currently broken. NOAA noticed a problem ...

tclune commented 7 months ago

@lizziel We found that this capability is broken. Those with better memory than me assert that it did work at one time. Raising the priority - might be a 1st where we have 3 different "customers" complaining about the same thing.

junwang-noaa commented 7 months ago

@tclune, @weiyuan-jiang , here are some details. The error message we got from UFS weather model:

  0: pe=00000 FAIL at line=00187    NetCDF4_FileFormatter.F90                <status=13>
  0: pe=00000 FAIL at line=00062    HistoryCollection.F90                    <status=13>
  0: pe=00000 FAIL at line=00811    ServerThread.F90                         <status=13>
  0: pe=00000 FAIL at line=00138    BaseServer.F90                           <status=13>
  0: pe=00000 FAIL at line=01002    ServerThread.F90                         <status=13>
  0: pe=00000 FAIL at line=00097    MessageVisitor.F90                       <status=13>
  0: pe=00000 FAIL at line=00115    AbstractMessage.F90                      <status=13>
  0: pe=00000 FAIL at line=00107    SimpleSocket.F90                         <status=13>
  0: pe=00000 FAIL at line=00449    ClientThread.F90                         <status=13>
  0: pe=00000 FAIL at line=00399    ClientManager.F90                        <status=13>
  0: pe=00000 FAIL at line=03560    MAPL_HistoryGridComp.F90                 <status=13>
  0: pe=00000 FAIL at line=01901    MAPL_Generic.F90                         <status=13>
  0: pe=00000 FAIL at line=01291    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=01220    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=01166    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=00834    MAPL_CapGridComp.F90                     <status=13>
  0: pe=00000 FAIL at line=00974    MAPL_CapGridComp.F90                     <status=13>

Please let me know if you want to reproduce the case. So far, the atmosphere can write out files with symbolic link. Before the run:

[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l sfcf000.nc
lrwxrwxrwx 1 Jun.Wang stmp 17 Mar 15 16:45 sfcf000.nc -> output/sfcf000.nc
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/sfcf000.nc
ls: cannot access output/sfcf000.nc: No such file or directory
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l gocart.inst_aod.20210322_1200z.nc4
lrwxrwxrwx 1 Jun.Wang stmp 41 Mar 15 16:44 gocart.inst_aod.20210322_1200z.nc4 -> output/gocart.inst_aod.20210322_1200z.nc4
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/gocart.inst_aod.20210322_1200z.nc4
ls: cannot access output/gocart.inst_aod.20210322_1200z.nc4: No such file or directory

Then I saw the error message when running the test, and the following in the run directory:

[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l sfcf000.nc
lrwxrwxrwx 1 Jun.Wang stmp 17 Mar 15 16:45 sfcf000.nc -> output/sfcf000.nc
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/sfcf000.nc
-rw-r--r-- 1 Jun.Wang stmp 85452865 Mar 15 17:49 output/sfcf000.nc
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l gocart.inst_aod.20210322_1200z.nc4
lrwxrwxrwx 1 Jun.Wang stmp 41 Mar 15 16:44 gocart.inst_aod.20210322_1200z.nc4 -> output/gocart.inst_aod.20210322_1200z.nc4
[Jun.Wang@hfe03 atmaero_control_p8_intel_t1]$ ls -l output/gocart.inst_aod.20210322_1200z.nc4
ls: cannot access output/gocart.inst_aod.20210322_1200z.nc4: No such file or directory
mathomp4 commented 7 months ago

Actually, now that I think about it, I think @rtodling is safe. The oddity is occurring because of the broken-symlink style. We are pondering this...

mathomp4 commented 7 months ago

Related issue: https://github.com/GEOS-ESM/MAPL/issues/1620

tclune commented 7 months ago

@junwang-noaa We can only replicate that particular error when the symlink itself is broken. But ... there is a different problem that you will hit once you fix that one.

The history option Allow_Overwrite does not currently propagate to the server side and fixing that is more subtle than you might have thought. We have scenarios in which a previous segment of a simulation has already written a time slice to a history output file and then the file needs to be appended-to rather overwritten.

This late on Friday this is making my head hurt. On Monday I will work with @bena-nasa to diagram the various cases, what should happen and how to even detect when it should clobber vs append. Sigh.

bena-nasa commented 7 months ago

All, I made a new issue with summarizes what is going on in much more detail. https://github.com/GEOS-ESM/MAPL/issues/2653

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity in the last 60 days. If there are no updates within 7 days, it will be closed. You can add the ":hourglass: Long Term" label to prevent the stale action from closing this issue.

mathomp4 commented 1 month ago

Closing in favor of #2653 (which might be fixed? → @bena-nasa )