MPAS-Dev / MPAS

Repository for private MPAS development prior to the MPAS v6.0 release.
Other
4 stars 0 forks source link

Config option having different values on different procs #373

Closed matthewhoffman closed 8 years ago

matthewhoffman commented 9 years ago

In the LI core, I have encountered a situations where a namelist config option has different values on different processors. This is a continuation of the discussion related to code hanging in PR #354

The option is defined as:

        <nml_option name="config_print_thickness_advection_info" type="logical" default_value=".true." units="unitless"
                    description="Prints additional information about thickness advection."
                    possible_values=".true. or .false."
        />

I added a print statement of its value to the code:

     call mpas_pool_get_config(liConfigs, 'config_print_thickness_advection_info', config_print_thickness_advection_info)
  print *, '3advect_info=', config_print_thickness_advection_info

And then when I run on 4 procs, I see different value on proc 0 than the other 3 procs:

$ grep 3advect_info *out
log.0000.out: 3advect_info= F
log.0001.out: 3advect_info= T
log.0002.out: 3advect_info= T
log.0003.out: 3advect_info= T

(In this run the actual value set in the namelist file is True.)

In debug mode, PIO complains about the differing values between processors:

Warning (inconsistent metadata): attribute "config_print_thickness_advection_info" length (2 != 3)
Warning (inconsistent metadata): attribute "config_print_thickness_advection_info" CHAR (NO���À != YES)

It is not clear to me what is offending about this particular namelist option that would cause this here but not for the other options. I suspected it was due to the long length of its name (37 characters), though the ocean core has equally long options and has not noticed a problem.

douglasjacobsen commented 9 years ago

@matthewhoffman Can you post a log.*.err file from a run with this?

matthewhoffman commented 9 years ago

log.0000.err:

 Reading namelist from file namelist.landice
 Namelist record debug not found; using default values for variables in this namelist
 Error: Config config_output_interval not found in pool.
 Error: Config config_restart_interval not found in pool.
 Reading streams configuration from file streams.landice
Found grid stream with template landice_grid.nc
  ** Attempting to bootstrap MPAS framework using stream: input
 Bootstrapping framework with mesh fields from input file 'landice_grid.nc'

Parsing run-time I/O configuration from streams.landice ...

 -----  found immutable stream "basicmesh" in streams.landice  -----
        filename template:  not-to-be-used.nc
        filename interval:  none
        direction:          none
        reference time:     initial_time
        record interval:    -

 -----  found immutable stream "input" in streams.landice  -----
        filename template:  landice_grid.nc
        filename interval:  none
        direction:          input
        reference time:     initial_time
        record interval:    -
        input alarm:        initial_only

 -----  found immutable stream "restart" in streams.landice  -----
        filename template:  restart.$Y-$M-$D_$h.$m.$s.nc
        filename interval:  0001-00-00_00:00:00
        clobber mode:       replace_files
        direction:          input, output
        reference time:     0000-01-01_00:00:00
        record interval:    -
        real precision:     8 bytes
        input alarm:        initial_only
        output alarm:       0001-00-00_00:00:00

 -----  found stream "output" in streams.landice  -----
        filename template:  output.nc
        filename interval:  none
        clobber mode:       replace_files
        direction:          output
        reference time:     0000-01-01_00:00:00
        record interval:    -
        real precision:     8 bytes
        output alarm:       0001-00-00_00:00:00

 ----- done parsing run-time I/O from streams.landice -----

Reading dimensions from input streams ...

 ----- reading dimensions from stream 'input' using file landice_grid.nc
            nVertInterfaces *** not found in stream ***
                   maxEdges =       6
                  nVertices =    2040
                     nEdges =    3060
                  maxEdges2 =      12
                     nCells =    1020
               vertexDegree =       3
                        TWO =       2
                nVertLevels =       9

   *** unable to open input file restart.0000-01-01_00.00.00.nc for stream 'restart'

 ----- done reading dimensions from input streams -----

Assigning remaining dimensions from definitions in Registry.xml ...
                         R3 =       3
            nVertInterfaces =      10

 ----- done assigning dimensions from Registry.xml -----

 WARNING: Variable dirichletVelocityMask not in input file.
 WARNING: Variable uReconstructX not in input file.
 WARNING: Variable uReconstructY not in input file.
 WARNING: Variable xtime not in input file.
 WARNING: File landice_grid.nc does not contain a seekable xtime variable. Forcing a read of the first time record.
 Initial timestep 0000-01-01_00:00:00
MPAS I/O: Truncating existing data in output file output.nc
 Doing timestep 0001-01-01_00:00:00

log.0001.err:

 Reading namelist from file namelist.landice
 Error: Config config_output_interval not found in pool.
 Error: Config config_restart_interval not found in pool.
 Reading streams configuration from file streams.landice
Found grid stream with template landice_grid.nc
  ** Attempting to bootstrap MPAS framework using stream: input
 Bootstrapping framework with mesh fields from input file 'landice_grid.nc'

Parsing run-time I/O configuration from streams.landice ...

 -----  found immutable stream "basicmesh" in streams.landice  -----
        filename template:  not-to-be-used.nc
        filename interval:  none
        direction:          none
        reference time:     initial_time
        record interval:    -

 -----  found immutable stream "input" in streams.landice  -----
        filename template:  landice_grid.nc
        filename interval:  none
        direction:          input
        reference time:     initial_time
        record interval:    -
        input alarm:        initial_only

 -----  found immutable stream "restart" in streams.landice  -----
        filename template:  restart.$Y-$M-$D_$h.$m.$s.nc
        filename interval:  0001-00-00_00:00:00
        clobber mode:       replace_files
        direction:          input, output
        reference time:     0000-01-01_00:00:00
        record interval:    -
        real precision:     8 bytes
        input alarm:        initial_only
        output alarm:       0001-00-00_00:00:00

 -----  found stream "output" in streams.landice  -----
        filename template:  output.nc
        filename interval:  none
        clobber mode:       replace_files
        direction:          output
        reference time:     0000-01-01_00:00:00
        record interval:    -
        real precision:     8 bytes
        output alarm:       0001-00-00_00:00:00

 ----- done parsing run-time I/O from streams.landice -----

Reading dimensions from input streams ...

 ----- reading dimensions from stream 'input' using file landice_grid.nc
            nVertInterfaces *** not found in stream ***
                   maxEdges =       6
                  nVertices =    2040
                     nEdges =    3060
                  maxEdges2 =      12
                     nCells =    1020
               vertexDegree =       3
                        TWO =       2
                nVertLevels =       9

   *** unable to open input file restart.0000-01-01_00.00.00.nc for stream 'restart'

 ----- done reading dimensions from input streams -----

Assigning remaining dimensions from definitions in Registry.xml ...
                         R3 =       3
            nVertInterfaces =      10

 ----- done assigning dimensions from Registry.xml -----

 WARNING: Variable dirichletVelocityMask not in input file.
 WARNING: Variable uReconstructX not in input file.
 WARNING: Variable uReconstructY not in input file.
 WARNING: Variable xtime not in input file.
 WARNING: File landice_grid.nc does not contain a seekable xtime variable. Forcing a read of the first time record.
 Initial timestep 0000-01-01_00:00:00
MPAS I/O: Truncating existing data in output file output.nc
 Doing timestep 0001-01-01_00:00:00

and the diff, which looks interesting, since the record 'debug' is the record to which the option described above belongs:

$ diff log.0000.err log.0001.err
2d1
<  Namelist record debug not found; using default values for variables in this namelist
douglasjacobsen commented 9 years ago

@matthewhoffman Just so you know what I'm thinking...

The code that reads in the namelists will only broadcast a value if the read is successful. So, it seems like there is a problem reading in that namelist record.

douglasjacobsen commented 9 years ago

@matthewhoffman Can you post the code in src/inc/namelist_defines.inc (after a successful build)?

douglasjacobsen commented 9 years ago

@matthewhoffman Also, you could go back to a commit where the default value was .false. and see if that message also appears in log.0000.err.

matthewhoffman commented 9 years ago

@douglasjacobsen - I'll check those things later when I get a chance. I also want to check to see if I inadvertently removed or deactivated an error check that should have aborted the model when the namelist record read error occurred.

douglasjacobsen commented 9 years ago

@matthewhoffman No problem. It might also be helpful to see the namelist you were running with.

matthewhoffman commented 9 years ago

Contents of src/inc/namelist_defines.inc: https://gist.github.com/matthewhoffman/6657d73fda2175fab2a4

matthewhoffman commented 9 years ago

namelist.landice (written by lettuce):

&velocity_solver
config_velocity_solver='FO'
/
&advection
/
&physical_parameters
config_flowLawExponent=3.0
config_ice_density=910.0
config_default_flowParamA=3.1709792e-24
config_dynamic_thickness=10.0
/
&time_integration
config_dt='0001-00-00_00:00:00'
config_time_integration='forward_euler'
/
&time_management
config_start_time='0000-01-01_00:00:00'
config_do_restart=.false.
config_run_duration='none'
config_restart_timestamp_name='restart_timestamp'
config_calendar_type='gregorian_noleap'
config_stop_time='0002-01-01_00:00:00'
/
&io
config_pio_stride=1
config_write_output_on_startup=.true.
config_year_digits=4
config_pio_num_iotasks=0
/
&decomposition
config_number_of_blocks=0
config_proc_decomp_file_prefix='graph.info.part.'
config_block_decomp_file_prefix='graph.info.part.'
config_explicit_proc_decomp=.false.
config_num_halos=3
/
&debug
config_print_thickness_advection_info=.false.
/
matthewhoffman commented 9 years ago

@douglasjacobsen , I think I figured this out.

The namelist file that lettuce produced did not have a newline on the last line, i.e., when I cat it in my terminal I see:

...
&debug
config_print_thickness_advection_info=.false.
/09:55:58 ~/documents/mpas-git/mpas-lettuce-testing/testing_tests/dome$

If I add a newline manually, then the code runs successfully. So I guess one could consider this a bug in lettuce, which I will fix. However, it seems somewhat fragile for MPAS to run without the newline but to set different config values on different processors, so I think it would be good if we can have MPAS either die or handle this.

douglasjacobsen commented 9 years ago

So, for some reason this issue causes the read function in fortran to give a non-zero error code (that's less than 0). Currently, we trap this error code for when a namelist record is missing entirely, so that we can use the default values without causing a fatal error.

In the descriptions of this function I've see, I don't see a standard error code for either of these cases.

What I'd propose, is always broadcasting the value from IONODE to the rest of the processors, and printing a less specific message (i.e. "There was an issue reading namelist record blah. Using values: ---print values ---").

But we can talk about this more on the low level telecon today.

douglasjacobsen commented 8 years ago

This has been fixed in develop.