Open wmputman opened 5 years ago
@wmputman I cannot see how that cannot work. I mean, that's basic Linux 101 there.
I could easily belt-and-suspender it with file existence checks, using /bin/cat
, checking status, etc., which I suppose we should do everywhere, but it's cat
, which is pretty boring and doesn't do much.
Is this only happening on, say, SLES10? That system does have an older cat
(v. 8.12) versus that on SLES12 (v. 8.25). I can't imagine cat
having bugs, but I can check the changelogs for GNU and see.
I looked at the Changelogs and the only mention of cat
was in that for 8.17:
cp,mv,install,cat,split: now read and write a minimum of 64KiB at a time. This was previously 32KiB and increasing to 64KiB was seen to increase throughput by about 10% when reading cached files on 64 bit GNU/Linux.
That's not really a bug fix, though.
This has happened on SLES11 (once) and SLES12 (frequently) over the last month.
The only other thing I can see is in perhaps in FV3. For example, fv_control.F90
has this bit:
697 │ f_unit=open_namelist_file()
698 │ rewind (f_unit)
699 │ ! Read Main namelist
700 │ read (f_unit,fv_grid_nml,iostat=ios)
701 │ ierr = check_nml_error(ios,'fv_grid_nml')
702 │ call close_file(f_unit)
gfdl_cloud_microphys.F90
has:
3488 │ nlunit=open_namelist_file()
3489 │ rewind (nlunit)
3490 │ ! Read Main namelist
3491 │ read (nlunit,gfdl_cloud_microphysics_nml,iostat=ios)
3492 │ ierr = check_nml_error(ios,'gfdl_cloud_microphysics_nml')
3493 │ call close_file(nlunit)
Fairly similar. Open, rewind, read, check, close. However, later on in fv_control.F90
there is this:
733 │ if (size(Atm) == 1) then
734 │ f_unit = open_namelist_file()
735 │ else if (n == 1) then
736 │ f_unit = open_namelist_file('input.nml')
737 │ else
738 │ write(nested_grid_filename,'(A10, I2.2, A4)') 'input_nest', n, '.nml'
739 │ f_unit = open_namelist_file(nested_grid_filename)
740 │ endif
741 │
742 │ ! Read FVCORE namelist
743 │ read (f_unit,fv_core_nml,iostat=ios)
744 │ ierr = check_nml_error(ios,'fv_core_nml')
745 │
746 │ ! Read Test_Case namelist
747 │ rewind (f_unit)
748 │ read (f_unit,test_case_nml,iostat=ios)
749 │ ierr = check_nml_error(ios,'test_case_nml')
750 │ call close_file(f_unit)
This is about the only place I could find that there was an open_namelist_file()
without a rewind()
right after it. But, the check_nml_error()
call should catch any iostat issues.
Now, fms itself seems to read namelist files a bit differently than FV3. From fms.F90
:
357 │ if (file_exist('input.nml')) then
358 │ unit = open_namelist_file ( )
359 │ ierr=1; do while (ierr /= 0)
360 │ read (unit, nml=fms_nml, iostat=io, end=10)
361 │ ierr = check_nml_error(io,'fms_nml') ! also initializes nml error codes
362 │ enddo
363 │ 10 call mpp_close (unit)
364 │ endif
but it's equivalent I think (close_file()
is essentially a wrapper on mpp_close()
)
I am trying one thing now. Per Rusty in the source:
1296 │ !-----------------------------------------------------------------------
1297 │ ! subroutine READ_INPUT_NML
1298 │ !
1299 │ !
1300 │ ! Reads an existing input.nml into a character array and broadcasts
1301 │ ! it to the non-root mpi-tasks. This allows the use of reads from an
1302 │ ! internal file for namelist settings (requires 2003 compliant compiler)
1303 │ !
1304 │ ! read(input_nml_file, nml=<name_nml>, iostat=status)
This seems to be triggered by the INTERNAL_FILE_NML
codepath. It would save file closes and opens if it works.
I'm building with the appropriate macro set and we'll see if it can even run.
Or wait, maybe not. FMS code is fun to read...
Well, that wasn't too hard. I can definitely activate the INTERNAL_FILE_NML
path. You have to change a few CMakeLists.txt
and edit the two microphysics files, but it seems zero-diff. Whether or not it helps is another ball of wax, but I suppose between @wmputman or @sdrabenh if one of you seems to more consistently hit the error, it could be something to try.
NOTE: I didn't change the compilation of MOM5 which of course would need the same ifdef activated, but baby steps first.
@wmputman et al, I sent an email to Rusty about INTERNAL_FILE_NML
:
My question is regarding namelist reading in FMS. Bill Putman seems to be having intermittent issues with it, so I took a look. I noticed your name in the INTERNAL_FILE_NML codepath.
My reading is that instead of say, all 96 processors reading the namelist every time it's processed in FMS, FV3, microphysics, etc., only root would read it and then broadcast the results to an internal file. Is that correct?
He replied with:
Your interpretation is correct. To use the internal file, you only need:
use mpp_mod, only: input_nml_file read (input_nml_file, <namelist>, iostat=io) ierr = check_nml_error (io, '<namelist>')
If you are using multiple namelist files, you can clear and re-read a new namelist as needed using the mpp_mod::read_input_nml subroutine.
As you run on many, many processors, @wmputman, it might be worth moving to INTERNAL_FILE_NML
for some testing. It might cost some MPI time to broadcast the character array, but it's probably better than 1000s of processes all opening the same file.
In the gcm_run.j and the gcm_forecast.tmpl there are occasions when the fvcore_layout.rc does not properly cat to input.nml, this leaves input.nml as an empty file. The most prevalent symptom for this at the moment is that the FMS stack size does not get set properly and the model fails due to exceeding stack limits.