Crash of long simulations

mee067 commented 8 months ago

This issue has been reported a few times by Sujata, Mazda, and possibly others. Long simulations crash at some point. I used not to face this issue for the MRB because simulations ran out of time (job time) before reaching such a crash point. I am running the Yukon long simulations now and got this crash in the year 2075 - the simulation started in 1951. Resuming the run from saved states (using SAVERESUME auto options) is a possible way to continue the simulation and get the results concatenated later. However, Mazda reported that resuming in routing-only mode is not done properly - I will try to reproduce and report that other issue separately. Here is a log of the crash:

[cnic-giws-cpu-19001-02:21453:0:21453] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aaae26f7000)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x000000000071fb35 route_()  ???:0
 2 0x000000000072d49c rte_module_mp_run_rte_between_grid_()  ???:0
 3 0x00000000007e2c80 sa_mesh_run_between_grid_mp_run_between_grid_()  ???:0
 4 0x000000000081b722 MAIN__()  ???:0
 5 0x000000000040bfce main()  ???:0
 6 0x00000000000202e0 __libc_start_main()  ???:0
 7 0x000000000040beea _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================

This looks like a memory issue with the routing routine "route". I am re-running with a code compiled with "symbols" to try to locate the error but it takes much more time to run than a usual run (almost 3-4 times). Are there any variables in "route" that could be overflowing or exceeding their dims as time progresses?

mee067 commented 8 months ago

btw, I checked the memory use of the job and it was only at 36% so it would not help asking for more memory.

kasra-keshavarz commented 8 months ago

I'm wondering if MESH is being run in MPI mode @mee067?

mee067 commented 8 months ago

Yes, @kasra-keshavarz - it is run in MPI mode

Now, after running using the same code but compiled with "symbols" - I could get to the line that causes the issue:

[cnic-giws-cpu-19002-01:145603:0:145603] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x0000000000126166 rml::internal::MemoryPool::getFromLLOCache()  ???:0
 2 0x0000000000126d8c scalable_aligned_malloc()  ???:0
 3 0x0000000000b592bb for_alloc_allocatable()  ???:0
 4 0x00000000009b118b rte_module_mp_run_rte_between_grid_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Routing_Model/RPN_watroute/sa_mesh_process/rte_module.f90:883
 5 0x0000000000ac3a45 sa_mesh_run_between_grid_mp_run_between_grid_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/sa_mesh_run_between_grid.f90:551
 6 0x0000000000b039f8 MAIN__()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/MESH_driver.f90:970
 7 0x000000000040c08e main()  ???:0
 8 0x00000000000202e0 __libc_start_main()  ???:0
 9 0x000000000040bfaa _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================

dprincz commented 8 months ago

It's probably an integer count that exceeds the precision of int32. What line does that correspond to in your code? Is it this line (per your MESH_Code commit): allocate(inline_qi(na), inline_stgch(na), inline_qo(na))

mee067 commented 8 months ago

Noting that it crashed this time earlier in 2065 instead of 2075. Memory utilization is 19.62% (even less than what I reported above).

lines 882-883 of rte_module.F90 read:

        !> Allocate the local variables for output averaging.
        allocate(inline_qi(na), inline_stgch(na), inline_qo(na))

Three 1D vectors of size "na" and na is the total number of active grids or subbasins (including outlets). These variables are locally defined in the subroutine. This subroutine has been called for each time step. It is puzzling that it only stopped after so many time steps.

The variables are defined on lines 713-714 as follows:

        !> Local variables for output averaging.
        real, dimension(:), allocatable :: inline_qi, inline_stgch, inline_qo

Then they get assigned/updated on lines 952-961 within a loop. They get deallocated at the end of the routine (line 979) after their contents are passed to SA_MESH variables.

Any ideas/clues on how to overcome the issue?

mee067 commented 8 months ago

why would "na" change after running for 100+ years?

dprincz commented 8 months ago

It shouldn't and I find it odd that it's having trouble with dynamic memory space after a number of iterations. I'm unaware of any constraint in Fortran that should limit it.

Try this...

714: real, dimension(na) :: inline_qi, inline_stgch, inline_qo (remove allocatable and change dimension to 'na')

883: !- allocate(inline_qi(na), inline_stgch(na), inline_qo(na)) (comment line)

979: !- deallocate(inline_qi, inline_stgch, inline_qo) (comment line)

mee067 commented 8 months ago

I inserted a print statement in the routine to print "na" to make sure nothing weird changes it. I ran for several days and it does not change - a good sign but it does not explain what happened. I will make the allocation static as you suggest and repeat the run again and see.

mee067 commented 8 months ago

@kasra-keshavarz I am still wondering about your question related to MPI. Do you think MPI has anything do to with this? Btw, routing is run on the head node and is not distributed across nodes, so far.

mee067 commented 8 months ago

Now after changing the allocation of the routing inline parameters to static, the error occurs on the same day but in another routine and it is related to MPI in this case, so @kasra-keshavarz may have had a gut feeling about it. Here is the line causing the error:

                !> Allocate temporary arrays.
                allocate(mpi_buffer_real1d(size(model_variables_to_head)*iin))

This is line 604 in routine run_within_tile_mpi_isend(fls, shd) which is part of the sa_mesh_run_within_tile.f90 module.

I guess the size of the model_variables_to_head has gotten so large causing a memory allocation problem. The question is whether this variable should be growing or not? Does it have any cumulative component?

dprincz commented 8 months ago

That size doesn't change. I think it's the same issue... Some problem with a limit regarding the dynamic allocation.

mee067 commented 8 months ago

Yes, I inserted printing statements and let it run for a year and a bit to see if it would increase at daily, monthly, or annual scale. model_variables_to_head = 62 all the time (for my case). iin is the number of tiles per node which varies in my case between 779 and 798 but it does not change over time, as expected.

What kind of limit could that be? I understand a limit of size but not a limit of the number of times you allocate a variable!

How to convert this one to be statically allocated? It is the product of two variables and one of them depends on the node which is looped upon. Unlike the local routing variables, this one really needs to be dynamically allocated given the way the routine is currently structured.

However, I feel this could hit again in another location if we fix this one. So we may be chasing this for ever.

Do you think a newer or a different compiler would be the solution? If anybody reported this as a compiler issue, they may have fixed it.

dprincz commented 8 months ago

I agree. I don't think it should. I think when I asked Sujata to look into this, she may have confirmed this only happened with the MPI version, so maybe it's some compiler option in the wrapper for that -- It was a while ago, so I'm not certain.

I can provide changes to allocate the variable during the 'init' routine so it will be re-used, but I wonder if the same thing might come up elsewhere... I guess we'll see.

mee067 commented 8 months ago

Newer compilers require newer versions of the MPI library. I will try to repeat the test with code compiled with a newer compiler and see. My setup is not small enough to test things without MPI. Sujata's basin was small enough so it could be used for such test. I am not sure I have the forcing for it to do so.

mee067 commented 8 months ago

I have now compiled the code using intel 2021 on Copernicus (make mpi_intel netcdf symbols). This uses openMPI 4.1.1, netcdf 4.8.0 and netcdf-fortran 4.5.3. The long simulation still crashed in 2070 (starting 1951) giving this error log:

[cnic-giws-cpu-19001-02:36842:0:36842] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:  36842) ====
 0 0x000000000001f093 ucs_debug_print_backtrace()  /tmp/ebuser/avx2/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/debug/debug.c:656
 1 0x00000000000130f0 __funlockfile()  :0
 2 0x000000000007278c H5AC_protect()  ???:0
 3 0x0000000000229c59 H5O_protect()  ???:0
 4 0x000000000022aff0 H5O_pin()  ???:0
 5 0x00000000000f221b H5D__mark()  ???:0
 6 0x00000000000f3c92 H5D__set_extent()  ???:0
 7 0x00000000000b5889 H5Dset_extent()  ???:0
 8 0x000000000010f519 NC4_put_vars.a()  hdf5var.c:0
 9 0x000000000010eb36 NC4_put_vara()  ???:0
10 0x000000000003db20 nc_put_vara_int()  ???:0
11 0x0000000000027dff nf_put_vara_int_.a()  nf_varaio.F90:0
12 0x00000000000bcb5e netcdf_mp_nf90_put_var_1d_fourbyteint_.a()  netcdf4.f90:0
13 0x0000000000626c48 nc_io_mp_nc4_add_data_1d_int_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Modules/io_modules/nc_io.F90:6457
14 0x0000000000632b62 nc_io_mp_nc4_add_data_xyt_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Modules/io_modules/nc_io.F90:7869
15 0x0000000000ab2487 output_files_mp_output_files_update_file_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/output_files.f90:2377
16 0x0000000000ab73fc output_files_mp_output_files_update_field_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/output_files.f90:2645
17 0x0000000000ab8dea output_files_mp_output_files_update_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/output_files.f90:2741
18 0x0000000000b4162b MAIN__()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/MESH_driver.f90:1030
19 0x000000000040cc52 main()  ???:0
20 0x0000000000023e1b __libc_start_main()  /cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308
21 0x000000000040cb6a _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
mpi_sa_mesh_Resum  0000000000B6978A  Unknown               Unknown  Unknown
libpthread-2.30.s  00002AAAAB5F60F0  Unknown               Unknown  Unknown
libhdf5.so.103.3.  00002AAAAEF9C78C  H5AC_protect          Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF153C59  H5O_protect           Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF154FF0  H5O_pin               Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF01C21B  H5D__mark             Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF01DC92  H5D__set_extent       Unknown  Unknown
libhdf5.so.103.3.  00002AAAAEFDF889  H5Dset_extent         Unknown  Unknown
libnetcdf.so.19    00002AAAAB8EE519  Unknown               Unknown  Unknown
libnetcdf.so.19    00002AAAAB8EDB36  NC4_put_vara          Unknown  Unknown
libnetcdf.so.19    00002AAAAB81CB20  nc_put_vara_int       Unknown  Unknown
libnetcdff.so.7.0  00002AAAAAB01DFF  Unknown               Unknown  Unknown
libnetcdff.so.7.0  00002AAAAAB96B5E  Unknown               Unknown  Unknown
mpi_sa_mesh_Resum  0000000000626C48  nc_io_mp_nc4_add_        6457  nc_io.F90
mpi_sa_mesh_Resum  0000000000632B62  nc_io_mp_nc4_add_        7869  nc_io.F90
mpi_sa_mesh_Resum  0000000000AB2487  output_files_mp_o        2377  output_files.f90
mpi_sa_mesh_Resum  0000000000AB73FC  output_files_mp_o        2645  output_files.f90
mpi_sa_mesh_Resum  0000000000AB8DEA  output_files_mp_o        2741  output_files.f90
mpi_sa_mesh_Resum  0000000000B4162B  MAIN__                   1030  MESH_driver.f90
mpi_sa_mesh_Resum  000000000040CC52  Unknown               Unknown  Unknown
libc-2.30.so       00002AAAAB626E1B  __libc_start_main     Unknown  Unknown
mpi_sa_mesh_Resum  000000000040CB6A  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
mpi_sa_mesh_Resum  0000000000B6982E  Unknown               Unknown  Unknown
libpthread-2.30.s  00002AAAAB5F60F0  Unknown               Unknown  Unknown
libhdf5.so.103.3.  00002AAAAEF9B222  H5AC_flush            Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF067471  H5F__dest             Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF068C17  H5F_try_close         Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF0688AC  H5F__close_cb         Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF101CAE  Unknown               Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF217D10  H5SL_try_free_saf     Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF101BA9  H5I_clear_type        Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF06110E  H5F_term_package      Unknown  Unknown
libhdf5.so.103.3.  00002AAAAEF7CD9A  H5_term_library       Unknown  Unknown
libc-2.30.so       00002AAAAB63DFC7  Unknown               Unknown  Unknown
libc-2.30.so       00002AAAAB63E17A  Unknown               Unknown  Unknown
mpi_sa_mesh_Resum  0000000000B5CCDC  Unknown               Unknown  Unknown
mpi_sa_mesh_Resum  0000000000B6978A  Unknown               Unknown  Unknown
libpthread-2.30.s  00002AAAAB5F60F0  Unknown               Unknown  Unknown
libhdf5.so.103.3.  00002AAAAEF9C78C  H5AC_protect          Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF153C59  H5O_protect           Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF154FF0  H5O_pin               Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF01C21B  H5D__mark             Unknown  Unknown
libhdf5.so.103.3.  00002AAAAF01DC92  H5D__set_extent       Unknown  Unknown
libhdf5.so.103.3.  00002AAAAEFDF889  H5Dset_extent         Unknown  Unknown
libnetcdf.so.19    00002AAAAB8EE519  Unknown               Unknown  Unknown
libnetcdf.so.19    00002AAAAB8EDB36  NC4_put_vara          Unknown  Unknown
libnetcdf.so.19    00002AAAAB81CB20  nc_put_vara_int       Unknown  Unknown
libnetcdff.so.7.0  00002AAAAAB01DFF  Unknown               Unknown  Unknown
libnetcdff.so.7.0  00002AAAAAB96B5E  Unknown               Unknown  Unknown
mpi_sa_mesh_Resum  0000000000626C48  nc_io_mp_nc4_add_        6457  nc_io.F90
mpi_sa_mesh_Resum  0000000000632B62  nc_io_mp_nc4_add_        7869  nc_io.F90
mpi_sa_mesh_Resum  0000000000AB2487  output_files_mp_o        2377  output_files.f90
mpi_sa_mesh_Resum  0000000000AB73FC  output_files_mp_o        2645  output_files.f90
mpi_sa_mesh_Resum  0000000000AB8DEA  output_files_mp_o        2741  output_files.f90
mpi_sa_mesh_Resum  0000000000B4162B  MAIN__                   1030  MESH_driver.f90
mpi_sa_mesh_Resum  000000000040CC52  Unknown               Unknown  Unknown
libc-2.30.so       00002AAAAB626E1B  __libc_start_main     Unknown  Unknown
mpi_sa_mesh_Resum  000000000040CB6A  Unknown               Unknown  Unknown
srun: error: cnic-giws-cpu-19001-02: task 0: Exited with exit code 174
slurmstepd: error:  mpi/pmix_v4: _errhandler: cnic-giws-cpu-19001-02 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.1847143.0:0]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 1847143.0 ON cnic-giws-cpu-19001-02 CANCELLED AT 2023-11-07T23:38:53 ***
srun: error: cnic-giws-cpu-19001-02: tasks 1-31: Killed
Program finished with exit code 137 at: Tue Nov  7 23:38:54 CST 2023

The error occurs in subroutine 'nc4_add_data_1d_int' in nc_io.F90; the line reads:

ierr = nf90_put_var(iun, vid, dat, start = start)

Which is a simple call to one of the nc library routines to write a variable. The calling routine nc4_add_data_xyt is trying to write the time axis value for some output. I am writing daily output using nc format, the time in days is 43,504 > 32,768 which is the integer end range. I believe it should have stopped at 32,768 or if we consider it unsigned, then it should stop only when reaching 65,535.

While it is still weird, this error still makes more sense than the above mentioned cases with dynamic allocation errors. I guess the way out is to use long or even float for the time axis. If we have hourly output, the integer range will get exhausted in less than 4 years.

Any suggestions?

mee067 commented 8 months ago

I checked another simulation where I had hourly output and the time axis reads:

       int time(time) ;
                time:units = "hours since 1951-01-01 00:00:00.000000"

The time is currently at (1,016,856) which much larger than the integer and unsigned integer ranges - there is something that I do not understand. How can nc handle an integer value that is out of range of integers?

dprincz commented 8 months ago

Interesting, and yes that does make more sense. Compile the code with double integer (gfortran: -fdefault-integer-8, ifort: -i8) and see if it's any different. You'll have to modify the makefile (I don't think I preserved the 'double-integer' option when I updated it).

mee067 commented 8 months ago

Will try that. But can you explain how nc still handles int time which is out of the int range?

mee067 commented 8 months ago

do you want to go all the way to 8-byte integers instead of trying 4-byte ones first? and if we use 4-byte or 8-byte, wouldn't it be better to go to float? or that will make us lose some precision?

dprincz commented 8 months ago

The default is 4-byte precision for integer. This will change it to 8-byte precision for integer (i.e., from int to long). Float only relates to type real. Those can be changed with the 'double' option with the existing makefile.

mee067 commented 8 months ago

I thought the default is 2-byte. If the default for integers is 4-byte, then all the above becomes baseless; we are still within the range (-2,147,483,647 to 2,147,483,647) or (0 to 4,294,967,295) if unsigned.

Anyway, -i8 caused compile error:

./Driver/MESH_Driver/program_end.f90(16): error #6285: There is no matching specific subroutine for this generic subroutine call.   [MPI_FINALIZE]
    call MPI_Finalize(ierr)
---------^
compilation aborted for ./Driver/MESH_Driver/program_end.f90 (code 1)
make: *** [makefile:197: program_end.o] Error 1

I guess the MPI library is not expecting 8-byte integer in such a case. The problematic call is this line:

call MPI_Finalize(ierr)

dprincz commented 8 months ago

What happens if in Modules/io_modules/nc_io.F90 you change line 5442 to: ierr = nf90_create(fpath, ior(NF90_NETCDF4, NF90_CLASSIC_MODEL), iun) I'm wondering if that might remove some of the HDF5 things... It should still produce a NetCDF4 file, but in the classic format. I don't think we use any special features of NetCDF4.

mee067 commented 8 months ago

should I do that with the -i8 option? I mean: is this to overcome the compilation issue or is it to overcome the crash of long simulations?

dprincz commented 8 months ago

It's not necessary. You determined it should have no effect.

dprincz commented 8 months ago

to overcome the crash of long simulations

mee067 commented 8 months ago

It's not necessary. You determined it should have no effect.

not necessary and not working either

mee067 commented 8 months ago

It compiled but upon running, it couldn't create the nc output files:

 READING: outputs_balance.txt
   ERROR: An error occurred saving the 'lat' variable (Code -39).
   ERROR: Errors occurred while applying the output configuration for 'SNO' (Line 1).
   ERROR: An error occurred saving the 'lat' variable (Code -39).
   ERROR: Errors occurred while applying the output configuration for 'FSNO' (Line 2).
   ERROR: An error occurred saving the 'lat' variable (Code -39).
   ERROR: Errors occurred while applying the output configuration for 'ZSNO' (Line 3).
   ERROR: An error occurred saving the 'lat' variable (Code -39).
   ERROR: Errors occurred while applying the output configuration for 'RFF' (Line 4).
   ERROR: Errors occurred while reading outputs_balance.txt.
   Abnormal exit.
1
srun: error: cnic-giws-cpu-19001-03: task 0: Exited with exit code 1
slurmstepd: error:  mpi/pmix_v4: _errhandler: cnic-giws-cpu-19001-03 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.1853231.0:0]
slurmstepd: error: *** STEP 1853231.0 ON cnic-giws-cpu-19001-03 CANCELLED AT 2023-11-08T14:22:23 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: cnic-giws-cpu-19001-03: tasks 1-15: Killed
Program finished with exit code 137 at: Wed Nov  8 14:22:24 CST 2023

mee067 commented 8 months ago

this is a grid-based setup

mee067 commented 8 months ago

Given the above errors were related to nc output, I made a long run saving no nc output, just the csv basin average and reach stuff. Still crashed after 120 years of simulation:

[cnic-giws-cpu-19001-02:260329:0:260329] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aaadcccb000)
==== backtrace (tid: 260329) ====
 0 0x000000000001f093 ucs_debug_print_backtrace()  /tmp/ebuser/avx2/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/debug/debug.c:656
 1 0x00000000000130f0 __funlockfile()  :0
 2 0x00000000009e4f35 rte_module_mp_run_rte_between_grid_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Routing_Model/RPN_watroute/sa_mesh_process/rte_module.f90:789
 3 0x0000000000b00534 sa_mesh_run_between_grid_mp_run_between_grid_()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/sa_mesh_run_between_grid.f90:551
 4 0x0000000000b41608 MAIN__()  /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/MESH_driver.f90:970
 5 0x000000000040cc52 main()  ???:0
 6 0x0000000000023e1b __libc_start_main()  /cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308
 7 0x000000000040cb6a _start()  ???:0

The line causing the issue in rte_module reads: if (fms%stmg%n > 0) qhyd(:, fhr) = real(fms%stmg%qomeas%val, kind(qhyd))

The condition seems to simple to think it can cause issues. I suspect the assignment may have an issue. Thus I checked:

qhyd is declared in area_watflood line 366. That module is used by rte_module and is kind of shared.

      real*4,   dimension(:,:), allocatable :: qhyd,qsyn,qloc,
     *                               delta,frc,frcs,
     *                               qhyd_dly,qsyn_dly,qhyd_mly,qsyn_mly

qhyd is allocated on lines 404-414 of rte_module.F90 which fall within the run_rte_initroutine.

        !> Streamflow gauge locations.
        no = fms%stmg%n
        if (fms%stmg%n > 0) then
            allocate( &
                iflowgrid(no), nopt(no), &
!todo: fix this (999999).
                **qhyd(no, 999999))**
            iflowgrid = fms%stmg%meta%rnk
            nopt = -1
            qhyd(:, 1) = real(fms%stmg%qomeas%val, kind(qhyd))
        end if

I traced the structures and declarations of fms%stmg%qomeas%val to find it as real (precision is non specified) - I think the typecasting is only effective when the code is compiled in double precision (which is not the case).

fhris an integer counter declared in area_watflood - not sure what does it stand; looks like a time counter in hours? It is incremented by 1 before the line in question. Could it be causing dim overflow? 120 years x 24 * 365 = 1,051,200 hours (well within the integer range) but > 999999 - see the bolded line. I guess that's the issue. It was flagged as something to fix - but was not fixed so far. I will redo the test making it 9,999,999 but I do not think this should be the ultimate solution. This variable is used to store the observed values of station which are read from file at a given frequency. Why do we need to store all values? I think STATS are done is some cumulative way not to require this.

Code compiled using intel 2021 on Copernicus (make mpi_intel netcdf symbols). This uses openMPI 4.1.1, netcdf 4.8.0 and netcdf-fortran 4.5.3. I made sure all those modules are loaded during run time.

mee067 commented 8 months ago

seems this 999,999 is used temporarily as big number for other things like reservoir releases, lake elevations, and so on. All marked with ! todo: fix this (999999) comments. Has anybody worked on that @dprincz - maybe EG?

mee067 commented 8 months ago

I checked the newer routing code (contains VF new solver and EG changes). The 999,999 thing was not fixed. It is a waste for memory in my opinion.

dprincz commented 8 months ago

Where you find allocations using 999999, replace it with 1. Where you see fhr += 1, comment that so fhr always stays 1.

mee067 commented 8 months ago

so you do agree that we do not need to keep the full time series of those variables - are you sure nothing else will get affected?

dprincz commented 8 months ago

As long as you update both those AND fhr, nothing will be affected. To confirm, when you replace those variables allocated to (:, 999999) to allocate them to (:, 1), you can scan the code with those variables names to see the second index is only ever taken using fhr.

mee067 commented 8 months ago

on the long run, those need to be vectors rather than 2D variables - no need to have a second dim if it always 1

mee067 commented 8 months ago

There is only thing to worry me, in run_rte_resume_read

        !> Read inital values from the file.
        read(iun) fhr_i4
        fhr = int(fhr_i4)

fhr could be something other than 1 if the value read from file isn't 1. Is this routine still used?

dprincz commented 8 months ago

The variables are defined as arrays when WATFLOOD would read a full file of records from file. They should remain as-is, with the second dimension allocated to 1.

Replace that line with a dummy read statement or update the line to assign it 1 in any case:

        !> Read inital values from the file.
        read(iun) fhr_i4
        fhr = 1 !int(fhr_i4)

mee067 commented 8 months ago

Are you saying the 2D thing will be needed if the routing is compiled as a stand-alone application?

mee067 commented 8 months ago

run_rte_resume_read is used if the seq format is used for resuming. Reading an initial value, how could we have a value larger than 1 for fhr_i4?

mee067 commented 8 months ago

just in from Mazda. A routing-only run starting 1951 and run using serially compiled mesh on a single core still crashed in 2073 due to a segmentation fault. The code was not compiled with "symbols" to show where it occurred but I could see something like "for_alloc_allocat" in the error log.

This negates the previous information that the issue is MPI-related. @dprincz mentioned that the crash did not occur with Sujata when she ran a serial version. If the issue is that 999,999 thing - I believe MPI has nothing to do with it. It should occur either way.

I am still running tests with the changes advised above.

mee067 commented 8 months ago

That was it. As far as I tested using code compiled with both intel 2018 and 2021, irrespective of having nc output. the 999999 fix has allowed both simulations to go all the way from 1951-2100 successfully 👍

Both simulations were MPI - so that was not the constraint. The annoying thing is that the error was caused by a certain statement but the 2018 version threw out the error somewhere else!

dprincz commented 8 months ago

Awesome. Now all those !todo's have finally been addressed. I'll close this now. Thanks for your efforts.

mee067 commented 8 months ago

Should I reverse the static allocation related to those inline variables: real, dimension(na) :: inline_qi, inline_stgch, inline_qo before uploading the code changes. I will update my code version of course.

dprincz commented 8 months ago

Leaving it as-changed should be fine.

MESH-Model / MESH-Dev

Crash of long simulations #19