Closed matthewhoffman closed 8 years ago
@matthewhoffman Can you post a log.*.err file from a run with this?
log.0000.err:
Reading namelist from file namelist.landice
Namelist record debug not found; using default values for variables in this namelist
Error: Config config_output_interval not found in pool.
Error: Config config_restart_interval not found in pool.
Reading streams configuration from file streams.landice
Found grid stream with template landice_grid.nc
** Attempting to bootstrap MPAS framework using stream: input
Bootstrapping framework with mesh fields from input file 'landice_grid.nc'
Parsing run-time I/O configuration from streams.landice ...
----- found immutable stream "basicmesh" in streams.landice -----
filename template: not-to-be-used.nc
filename interval: none
direction: none
reference time: initial_time
record interval: -
----- found immutable stream "input" in streams.landice -----
filename template: landice_grid.nc
filename interval: none
direction: input
reference time: initial_time
record interval: -
input alarm: initial_only
----- found immutable stream "restart" in streams.landice -----
filename template: restart.$Y-$M-$D_$h.$m.$s.nc
filename interval: 0001-00-00_00:00:00
clobber mode: replace_files
direction: input, output
reference time: 0000-01-01_00:00:00
record interval: -
real precision: 8 bytes
input alarm: initial_only
output alarm: 0001-00-00_00:00:00
----- found stream "output" in streams.landice -----
filename template: output.nc
filename interval: none
clobber mode: replace_files
direction: output
reference time: 0000-01-01_00:00:00
record interval: -
real precision: 8 bytes
output alarm: 0001-00-00_00:00:00
----- done parsing run-time I/O from streams.landice -----
Reading dimensions from input streams ...
----- reading dimensions from stream 'input' using file landice_grid.nc
nVertInterfaces *** not found in stream ***
maxEdges = 6
nVertices = 2040
nEdges = 3060
maxEdges2 = 12
nCells = 1020
vertexDegree = 3
TWO = 2
nVertLevels = 9
*** unable to open input file restart.0000-01-01_00.00.00.nc for stream 'restart'
----- done reading dimensions from input streams -----
Assigning remaining dimensions from definitions in Registry.xml ...
R3 = 3
nVertInterfaces = 10
----- done assigning dimensions from Registry.xml -----
WARNING: Variable dirichletVelocityMask not in input file.
WARNING: Variable uReconstructX not in input file.
WARNING: Variable uReconstructY not in input file.
WARNING: Variable xtime not in input file.
WARNING: File landice_grid.nc does not contain a seekable xtime variable. Forcing a read of the first time record.
Initial timestep 0000-01-01_00:00:00
MPAS I/O: Truncating existing data in output file output.nc
Doing timestep 0001-01-01_00:00:00
log.0001.err:
Reading namelist from file namelist.landice
Error: Config config_output_interval not found in pool.
Error: Config config_restart_interval not found in pool.
Reading streams configuration from file streams.landice
Found grid stream with template landice_grid.nc
** Attempting to bootstrap MPAS framework using stream: input
Bootstrapping framework with mesh fields from input file 'landice_grid.nc'
Parsing run-time I/O configuration from streams.landice ...
----- found immutable stream "basicmesh" in streams.landice -----
filename template: not-to-be-used.nc
filename interval: none
direction: none
reference time: initial_time
record interval: -
----- found immutable stream "input" in streams.landice -----
filename template: landice_grid.nc
filename interval: none
direction: input
reference time: initial_time
record interval: -
input alarm: initial_only
----- found immutable stream "restart" in streams.landice -----
filename template: restart.$Y-$M-$D_$h.$m.$s.nc
filename interval: 0001-00-00_00:00:00
clobber mode: replace_files
direction: input, output
reference time: 0000-01-01_00:00:00
record interval: -
real precision: 8 bytes
input alarm: initial_only
output alarm: 0001-00-00_00:00:00
----- found stream "output" in streams.landice -----
filename template: output.nc
filename interval: none
clobber mode: replace_files
direction: output
reference time: 0000-01-01_00:00:00
record interval: -
real precision: 8 bytes
output alarm: 0001-00-00_00:00:00
----- done parsing run-time I/O from streams.landice -----
Reading dimensions from input streams ...
----- reading dimensions from stream 'input' using file landice_grid.nc
nVertInterfaces *** not found in stream ***
maxEdges = 6
nVertices = 2040
nEdges = 3060
maxEdges2 = 12
nCells = 1020
vertexDegree = 3
TWO = 2
nVertLevels = 9
*** unable to open input file restart.0000-01-01_00.00.00.nc for stream 'restart'
----- done reading dimensions from input streams -----
Assigning remaining dimensions from definitions in Registry.xml ...
R3 = 3
nVertInterfaces = 10
----- done assigning dimensions from Registry.xml -----
WARNING: Variable dirichletVelocityMask not in input file.
WARNING: Variable uReconstructX not in input file.
WARNING: Variable uReconstructY not in input file.
WARNING: Variable xtime not in input file.
WARNING: File landice_grid.nc does not contain a seekable xtime variable. Forcing a read of the first time record.
Initial timestep 0000-01-01_00:00:00
MPAS I/O: Truncating existing data in output file output.nc
Doing timestep 0001-01-01_00:00:00
and the diff, which looks interesting, since the record 'debug' is the record to which the option described above belongs:
$ diff log.0000.err log.0001.err
2d1
< Namelist record debug not found; using default values for variables in this namelist
@matthewhoffman Just so you know what I'm thinking...
The code that reads in the namelists will only broadcast a value if the read is successful. So, it seems like there is a problem reading in that namelist record.
@matthewhoffman Can you post the code in src/inc/namelist_defines.inc
(after a successful build)?
@matthewhoffman Also, you could go back to a commit where the default value was .false.
and see if that message also appears in log.0000.err.
@douglasjacobsen - I'll check those things later when I get a chance. I also want to check to see if I inadvertently removed or deactivated an error check that should have aborted the model when the namelist record read error occurred.
@matthewhoffman No problem. It might also be helpful to see the namelist you were running with.
Contents of src/inc/namelist_defines.inc
:
https://gist.github.com/matthewhoffman/6657d73fda2175fab2a4
namelist.landice (written by lettuce):
&velocity_solver
config_velocity_solver='FO'
/
&advection
/
&physical_parameters
config_flowLawExponent=3.0
config_ice_density=910.0
config_default_flowParamA=3.1709792e-24
config_dynamic_thickness=10.0
/
&time_integration
config_dt='0001-00-00_00:00:00'
config_time_integration='forward_euler'
/
&time_management
config_start_time='0000-01-01_00:00:00'
config_do_restart=.false.
config_run_duration='none'
config_restart_timestamp_name='restart_timestamp'
config_calendar_type='gregorian_noleap'
config_stop_time='0002-01-01_00:00:00'
/
&io
config_pio_stride=1
config_write_output_on_startup=.true.
config_year_digits=4
config_pio_num_iotasks=0
/
&decomposition
config_number_of_blocks=0
config_proc_decomp_file_prefix='graph.info.part.'
config_block_decomp_file_prefix='graph.info.part.'
config_explicit_proc_decomp=.false.
config_num_halos=3
/
&debug
config_print_thickness_advection_info=.false.
/
@douglasjacobsen , I think I figured this out.
The namelist file that lettuce produced did not have a newline on the last line, i.e., when I cat it in my terminal I see:
...
&debug
config_print_thickness_advection_info=.false.
/09:55:58 ~/documents/mpas-git/mpas-lettuce-testing/testing_tests/dome$
If I add a newline manually, then the code runs successfully. So I guess one could consider this a bug in lettuce, which I will fix. However, it seems somewhat fragile for MPAS to run without the newline but to set different config values on different processors, so I think it would be good if we can have MPAS either die or handle this.
So, for some reason this issue causes the read
function in fortran to give a non-zero error code (that's less than 0). Currently, we trap this error code for when a namelist record is missing entirely, so that we can use the default values without causing a fatal error.
In the descriptions of this function I've see, I don't see a standard error code for either of these cases.
What I'd propose, is always broadcasting the value from IONODE to the rest of the processors, and printing a less specific message (i.e. "There was an issue reading namelist record blah. Using values: ---print values ---").
But we can talk about this more on the low level telecon today.
This has been fixed in develop.
In the LI core, I have encountered a situations where a namelist config option has different values on different processors. This is a continuation of the discussion related to code hanging in PR #354
The option is defined as:
I added a print statement of its value to the code:
And then when I run on 4 procs, I see different value on proc 0 than the other 3 procs:
(In this run the actual value set in the namelist file is True.)
In debug mode, PIO complains about the differing values between processors:
It is not clear to me what is offending about this particular namelist option that would cause this here but not for the other options. I suspected it was due to the long length of its name (37 characters), though the ocean core has equally long options and has not noticed a problem.