GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
27 stars 18 forks source link

Failure with zero-duration History collection with MAPL develop #1634

Closed mathomp4 closed 2 years ago

mathomp4 commented 2 years ago

Update: it's not MOM6, it's its history. I can duplicate with AGCM as well. See below.


It looks like the code in #1631 has broken the MOM6 history. My c12 runs last night failed with MAPL develop (and MAPL3 as it tracks develop). The traceback is:

 Writing:     10 Slices to File:  1day.geosgcm_iceflx.20000416_1200z.nc4
 Writing:     18 Slices to File:  1day.geosgcm_icethm.20000416_1200z.nc4
 Writing:     25 Slices to File:  1day.geosgcm_aoflux.20000416_1200z.nc4
pe=00003 FAIL at line=00171    NetCDF4_FileFormatter.F90                <status=-35>
pe=00003 FAIL at line=00078    HistoryCollection.F90                    <status=-35>
pe=00003 FAIL at line=00790    ServerThread.F90                         <status=-35>
pe=00003 FAIL at line=00138    BaseServer.F90                           <status=-35>
pe=00003 FAIL at line=00981    ServerThread.F90                         <status=-35>
pe=00003 FAIL at line=00094    MessageVisitor.F90                       <status=-35>
pe=00003 FAIL at line=00113    AbstractMessage.F90                      <status=-35>
pe=00003 FAIL at line=00107    SimpleSocket.F90                         <status=-35>
pe=00003 FAIL at line=00429    ClientThread.F90                         <status=-35>
pe=00003 FAIL at line=00363    ClientManager.F90                        <status=-35>
pe=00003 FAIL at line=03751    MAPL_HistoryGridComp.F90                 <status=-35>
pe=00003 FAIL at line=01807    MAPL_Generic.F90                         <status=-35>
pe=00003 FAIL at line=01282    MAPL_CapGridComp.F90                     <status=-35>
pe=00003 FAIL at line=01211    MAPL_CapGridComp.F90                     <status=-35>
pe=00003 FAIL at line=01157    MAPL_CapGridComp.F90                     <status=-35>
pe=00003 FAIL at line=00825    MAPL_CapGridComp.F90                     <status=-35>
pe=00003 FAIL at line=00965    MAPL_CapGridComp.F90                     <status=-35>
pe=00003 FAIL at line=00301    MAPL_Cap.F90                             <status=-35>
pe=00003 FAIL at line=00258    MAPL_Cap.F90                             <status=-35>
pe=00003 FAIL at line=00192    MAPL_Cap.F90                             <status=-35>
pe=00003 FAIL at line=00169    MAPL_Cap.F90                             <status=-35>
pe=00003 FAIL at line=00034    GEOSgcm.F90                              <status=-35>

I have a theory and I'll test now before handing off to @bena-nasa and @weiyuan-jiang

mathomp4 commented 2 years ago

Okay, the reason we saw this in MOM6 is because I think it has a type of collection the AGCM doesn't: one with a duration attribute.

So, I took the AGCM and edited HISTORY.rc to make the geosgcm_ocn collection to have:

  geosgcm_ocn.duration:         000000,

aka "append to file, not make new file".

And when I do that I get:

 Writing:     46 Slices to File:  mapldev-2022Aug12-1day-c24.geosgcm_ocn.20000414_2130z.nc4

 AGCM Date: 2000/04/14  Time: 22:00:00  Throughput(days/day)[Avg Tot Run]:        928.9        287.0        307.7  TimeRemaining(Est) 000:01:29   30.7% :  26.6% Mem Comm:Used
 AGCM Date: 2000/04/14  Time: 22:15:00  Throughput(days/day)[Avg Tot Run]:       1065.0       2577.8       2649.3  TimeRemaining(Est) 000:01:16   30.7% :  26.5% Mem Comm:Used
 AGCM Date: 2000/04/14  Time: 22:30:00  Throughput(days/day)[Avg Tot Run]:       1175.8       2456.0       2521.9  TimeRemaining(Est) 000:01:08   30.7% :  26.6% Mem Comm:Used
 AGCM Date: 2000/04/14  Time: 22:45:00  Throughput(days/day)[Avg Tot Run]:       1274.9       2582.9       2655.3  TimeRemaining(Est) 000:01:02   30.7% :  26.5% Mem Comm:Used

 Writing:     46 Slices to File:  mapldev-2022Aug12-1day-c24.geosgcm_ocn.20000414_2130z.nc4

 AGCM Date: 2000/04/14  Time: 23:00:00  Throughput(days/day)[Avg Tot Run]:       1259.5       1162.5       1450.1  TimeRemaining(Est) 000:01:02   30.7% :  26.6% Mem Comm:Used
 AGCM Date: 2000/04/14  Time: 23:15:00  Throughput(days/day)[Avg Tot Run]:       1335.4       2583.6       2657.9  TimeRemaining(Est) 000:00:58   30.7% :  26.5% Mem Comm:Used
 AGCM Date: 2000/04/14  Time: 23:30:00  Throughput(days/day)[Avg Tot Run]:       1399.3       2459.7       2525.3  TimeRemaining(Est) 000:00:55   30.7% :  26.6% Mem Comm:Used
 AGCM Date: 2000/04/14  Time: 23:45:00  Throughput(days/day)[Avg Tot Run]:       1460.0       2585.3       2658.8  TimeRemaining(Est) 000:00:52   30.7% :  26.5% Mem Comm:Used

 Writing:    483 Slices to File:  mapldev-2022Aug12-1day-c24.geosgcm_prog.20000415_0000z.nc4
 Writing:    153 Slices to File:  mapldev-2022Aug12-1day-c24.geosgcm_surf.20000414_2230z.nc4
 Writing:     46 Slices to File:  mapldev-2022Aug12-1day-c24.geosgcm_ocn.20000414_2130z.nc4
 Writing:      5 Slices to File:  mapldev-2022Aug12-1day-c24.geosgcm_budi.20000415_0000z.nc4
 Writing:     21 Slices to File:  mapldev-2022Aug12-1day-c24.geosgcm_buda.20000414_2230z.nc4
pe=00092 FAIL at line=00171    NetCDF4_FileFormatter.F90                <status=-35>
pe=00092 FAIL at line=00078    HistoryCollection.F90                    <status=-35>
pe=00092 FAIL at line=00790    ServerThread.F90                         <status=-35>
pe=00092 FAIL at line=00138    BaseServer.F90                           <status=-35>
pe=00092 FAIL at line=00981    ServerThread.F90                         <status=-35>
pe=00092 FAIL at line=00094    MessageVisitor.F90                       <status=-35>
pe=00092 FAIL at line=00113    AbstractMessage.F90                      <status=-35>
pe=00092 FAIL at line=00107    SimpleSocket.F90                         <status=-35>
pe=00092 FAIL at line=00429    ClientThread.F90                         <status=-35>
pe=00092 FAIL at line=00363    ClientManager.F90                        <status=-35>
pe=00092 FAIL at line=03751    MAPL_HistoryGridComp.F90                 <status=-35>
pe=00092 FAIL at line=01807    MAPL_Generic.F90                         <status=-35>
pe=00092 FAIL at line=01282    MAPL_CapGridComp.F90                     <status=-35>
pe=00092 FAIL at line=01211    MAPL_CapGridComp.F90                     <status=-35>
pe=00092 FAIL at line=01157    MAPL_CapGridComp.F90                     <status=-35>
pe=00092 FAIL at line=00825    MAPL_CapGridComp.F90                     <status=-35>
pe=00092 FAIL at line=00965    MAPL_CapGridComp.F90                     <status=-35>
pe=00092 FAIL at line=00301    MAPL_Cap.F90                             <status=-35>
pe=00092 FAIL at line=00258    MAPL_Cap.F90                             <status=-35>
pe=00092 FAIL at line=00192    MAPL_Cap.F90                             <status=-35>
pe=00092 FAIL at line=00169    MAPL_Cap.F90                             <status=-35>
pe=00092 FAIL at line=00034    GEOSgcm.F90                              <status=-35>
Abort(0) on node 92 (rank 92 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 92
GEOSgcm Run Status: -1

I am actually a bit surprised it got as far as it did. I sort of expected the second write to mapldev-2022Aug12-1day-c24.geosgcm_ocn.20000414_2130z.nc4 would have triggered the fault, but it got to the third. Hmm.

I'm guessing this is a weird "sometimes even GEOS needs to clobber history" sort of thing? Or maybe we need to have pFIO_APPEND to go along with pFIO_CLOBBER and pFIO_NOCLOBBER?

I leave this in the hands of the experts.

weiyuan-jiang commented 2 years ago

Is this duration 0000 a typo or a purpose ?

mathomp4 commented 2 years ago

Is this duration 0000 a typo or a purpose ?

No. It's a valid duration. I'd probably need @bena-nasa or @atrayano to explain it better, but if you set duration: 000000 what I think MAPL does is "append" to the history output.

So, if you have an hourly collection, by default you get one file every hour. But if you have duration: 000000, then you'll get one collection a day with 24 time-steps in it.

mathomp4 commented 2 years ago

So, if you have an hourly collection, by default you get one file every hour. But if you have duration: 000000, then you'll get one collection a day with 24 time-steps in it.

Welp, maybe I still need to learn. I just did it with our stock model running for two model days and you get:

netcdf stock-2022Aug12-1day-c24.geosgcm_ocn.20000414_2130z {
dimensions:
        lat = 49 ;
        lon = 96 ;
        time = UNLIMITED ; // (48 currently)
...

So I guess you just always append to this file even after a day.

@bena-nasa @atrayano Can you remind me how to do the "append for a day, but get new file each day" style?

bena-nasa commented 2 years ago

I'll check this out. Talked to @mathomp4, the odd thing is the ExtData test cases almost all use duration 0 and in some cases write an entire year of data to a single file and those passed...

bena-nasa commented 2 years ago

I found the problem, I might need @weiyuan-jiang help on monday to fix. The issue is that the file creation for each collection via the find method of the HistoryCollection takes place on a single process on the server. But from write to write, if that process changes, that may not be the process when the file was created, so the file_last_created was not set on the collection for that process. Since my ExtData tester is writing only to a single collection in the examples so far to generate the synthetic output, the rank is the same each time. Likewise in the standard non-ocean History collections no files have more than 1 time slice written so every file is new so there's no last file to check

bena-nasa commented 2 years ago

After looking at the code with @weiyuan-jiang there is no easy fix for this given the fact that the pfio server is really multiple server implementations/flavours and some of those are meant to work asynchronously.

Here's the issue, as alluded to in the previous comment. Currently the writing on the server per collection is done by a single process. Which process that is and how it changes from write to write depends on which flavour of the pfio server you are using. Take the "multiple" group server. Each time it is time to write the backend controller farms out the work of writing to whatever processes are available. The file is not either created or opened anywhere but on the writing processes and the inquiry about its existence did not take place until then either.

Let's say at a timestep collection 5 was assigned to write on process 1. The logic that was added in develop said:

If the file doesn't exist create, mark down in the collection that the server created it via an allocatable character array.

If it does exist, first check if the server created it, if it did create the file just open the file, if it did not create the file with the clobber condition set by the user, at this point you may succeed for fail if the user said noclobber, if you allow clobbering, mark that the server created the file.

The problem is that at this point when marking, it was on a single process.

Let's say the next timestep, collection 5 writes on process 2. Well, if last time collection 5 was on 1, I only marked on process 2, and since the for very, very good reasons process 1 can't and should not talk to process 2.

So something needs to happen outside the writing processes, that is collective across the server for all the server types we support. Ug, not easy...