Closed mee067 closed 1 year ago
btw, I checked the memory use of the job and it was only at 36% so it would not help asking for more memory.
I'm wondering if MESH
is being run in MPI
mode @mee067?
Yes, @kasra-keshavarz - it is run in MPI mode
Now, after running using the same code but compiled with "symbols" - I could get to the line that causes the issue:
[cnic-giws-cpu-19002-01:145603:0:145603] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace ====
0 0x0000000000010e90 __funlockfile() ???:0
1 0x0000000000126166 rml::internal::MemoryPool::getFromLLOCache() ???:0
2 0x0000000000126d8c scalable_aligned_malloc() ???:0
3 0x0000000000b592bb for_alloc_allocatable() ???:0
4 0x00000000009b118b rte_module_mp_run_rte_between_grid_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Routing_Model/RPN_watroute/sa_mesh_process/rte_module.f90:883
5 0x0000000000ac3a45 sa_mesh_run_between_grid_mp_run_between_grid_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/sa_mesh_run_between_grid.f90:551
6 0x0000000000b039f8 MAIN__() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/MESH_driver.f90:970
7 0x000000000040c08e main() ???:0
8 0x00000000000202e0 __libc_start_main() ???:0
9 0x000000000040bfaa _start() /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================
It's probably an integer count that exceeds the precision of int32. What line does that correspond to in your code? Is it this line (per your MESH_Code commit):
allocate(inline_qi(na), inline_stgch(na), inline_qo(na))
Noting that it crashed this time earlier in 2065 instead of 2075. Memory utilization is 19.62% (even less than what I reported above).
lines 882-883 of rte_module.F90 read:
!> Allocate the local variables for output averaging.
allocate(inline_qi(na), inline_stgch(na), inline_qo(na))
Three 1D vectors of size "na" and na is the total number of active grids or subbasins (including outlets). These variables are locally defined in the subroutine. This subroutine has been called for each time step. It is puzzling that it only stopped after so many time steps.
The variables are defined on lines 713-714 as follows:
!> Local variables for output averaging.
real, dimension(:), allocatable :: inline_qi, inline_stgch, inline_qo
Then they get assigned/updated on lines 952-961 within a loop. They get deallocated at the end of the routine (line 979) after their contents are passed to SA_MESH variables.
Any ideas/clues on how to overcome the issue?
why would "na" change after running for 100+ years?
It shouldn't and I find it odd that it's having trouble with dynamic memory space after a number of iterations. I'm unaware of any constraint in Fortran that should limit it.
Try this...
714: real, dimension(na) :: inline_qi, inline_stgch, inline_qo
(remove allocatable and change dimension to 'na')
883: !- allocate(inline_qi(na), inline_stgch(na), inline_qo(na))
(comment line)
979: !- deallocate(inline_qi, inline_stgch, inline_qo)
(comment line)
I inserted a print statement in the routine to print "na" to make sure nothing weird changes it. I ran for several days and it does not change - a good sign but it does not explain what happened. I will make the allocation static as you suggest and repeat the run again and see.
@kasra-keshavarz I am still wondering about your question related to MPI. Do you think MPI has anything do to with this? Btw, routing is run on the head node and is not distributed across nodes, so far.
Now after changing the allocation of the routing inline parameters to static, the error occurs on the same day but in another routine and it is related to MPI in this case, so @kasra-keshavarz may have had a gut feeling about it. Here is the line causing the error:
!> Allocate temporary arrays.
allocate(mpi_buffer_real1d(size(model_variables_to_head)*iin))
This is line 604 in routine run_within_tile_mpi_isend(fls, shd)
which is part of the sa_mesh_run_within_tile.f90
module.
I guess the size of the model_variables_to_head
has gotten so large causing a memory allocation problem. The question is whether this variable should be growing or not? Does it have any cumulative component?
That size doesn't change. I think it's the same issue... Some problem with a limit regarding the dynamic allocation.
Yes, I inserted printing statements and let it run for a year and a bit to see if it would increase at daily, monthly, or annual scale. model_variables_to_head
= 62 all the time (for my case). iin
is the number of tiles per node which varies in my case between 779 and 798 but it does not change over time, as expected.
What kind of limit could that be? I understand a limit of size but not a limit of the number of times you allocate a variable!
How to convert this one to be statically allocated? It is the product of two variables and one of them depends on the node which is looped upon. Unlike the local routing variables, this one really needs to be dynamically allocated given the way the routine is currently structured.
However, I feel this could hit again in another location if we fix this one. So we may be chasing this for ever.
Do you think a newer or a different compiler would be the solution? If anybody reported this as a compiler issue, they may have fixed it.
I agree. I don't think it should. I think when I asked Sujata to look into this, she may have confirmed this only happened with the MPI version, so maybe it's some compiler option in the wrapper for that -- It was a while ago, so I'm not certain.
I can provide changes to allocate the variable during the 'init' routine so it will be re-used, but I wonder if the same thing might come up elsewhere... I guess we'll see.
Newer compilers require newer versions of the MPI library. I will try to repeat the test with code compiled with a newer compiler and see. My setup is not small enough to test things without MPI. Sujata's basin was small enough so it could be used for such test. I am not sure I have the forcing for it to do so.
I have now compiled the code using intel 2021 on Copernicus (make mpi_intel netcdf symbols
). This uses openMPI 4.1.1, netcdf 4.8.0 and netcdf-fortran 4.5.3. The long simulation still crashed in 2070 (starting 1951) giving this error log:
[cnic-giws-cpu-19001-02:36842:0:36842] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 36842) ====
0 0x000000000001f093 ucs_debug_print_backtrace() /tmp/ebuser/avx2/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/debug/debug.c:656
1 0x00000000000130f0 __funlockfile() :0
2 0x000000000007278c H5AC_protect() ???:0
3 0x0000000000229c59 H5O_protect() ???:0
4 0x000000000022aff0 H5O_pin() ???:0
5 0x00000000000f221b H5D__mark() ???:0
6 0x00000000000f3c92 H5D__set_extent() ???:0
7 0x00000000000b5889 H5Dset_extent() ???:0
8 0x000000000010f519 NC4_put_vars.a() hdf5var.c:0
9 0x000000000010eb36 NC4_put_vara() ???:0
10 0x000000000003db20 nc_put_vara_int() ???:0
11 0x0000000000027dff nf_put_vara_int_.a() nf_varaio.F90:0
12 0x00000000000bcb5e netcdf_mp_nf90_put_var_1d_fourbyteint_.a() netcdf4.f90:0
13 0x0000000000626c48 nc_io_mp_nc4_add_data_1d_int_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Modules/io_modules/nc_io.F90:6457
14 0x0000000000632b62 nc_io_mp_nc4_add_data_xyt_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Modules/io_modules/nc_io.F90:7869
15 0x0000000000ab2487 output_files_mp_output_files_update_file_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/output_files.f90:2377
16 0x0000000000ab73fc output_files_mp_output_files_update_field_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/output_files.f90:2645
17 0x0000000000ab8dea output_files_mp_output_files_update_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/output_files.f90:2741
18 0x0000000000b4162b MAIN__() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/MESH_driver.f90:1030
19 0x000000000040cc52 main() ???:0
20 0x0000000000023e1b __libc_start_main() /cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308
21 0x000000000040cb6a _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
mpi_sa_mesh_Resum 0000000000B6978A Unknown Unknown Unknown
libpthread-2.30.s 00002AAAAB5F60F0 Unknown Unknown Unknown
libhdf5.so.103.3. 00002AAAAEF9C78C H5AC_protect Unknown Unknown
libhdf5.so.103.3. 00002AAAAF153C59 H5O_protect Unknown Unknown
libhdf5.so.103.3. 00002AAAAF154FF0 H5O_pin Unknown Unknown
libhdf5.so.103.3. 00002AAAAF01C21B H5D__mark Unknown Unknown
libhdf5.so.103.3. 00002AAAAF01DC92 H5D__set_extent Unknown Unknown
libhdf5.so.103.3. 00002AAAAEFDF889 H5Dset_extent Unknown Unknown
libnetcdf.so.19 00002AAAAB8EE519 Unknown Unknown Unknown
libnetcdf.so.19 00002AAAAB8EDB36 NC4_put_vara Unknown Unknown
libnetcdf.so.19 00002AAAAB81CB20 nc_put_vara_int Unknown Unknown
libnetcdff.so.7.0 00002AAAAAB01DFF Unknown Unknown Unknown
libnetcdff.so.7.0 00002AAAAAB96B5E Unknown Unknown Unknown
mpi_sa_mesh_Resum 0000000000626C48 nc_io_mp_nc4_add_ 6457 nc_io.F90
mpi_sa_mesh_Resum 0000000000632B62 nc_io_mp_nc4_add_ 7869 nc_io.F90
mpi_sa_mesh_Resum 0000000000AB2487 output_files_mp_o 2377 output_files.f90
mpi_sa_mesh_Resum 0000000000AB73FC output_files_mp_o 2645 output_files.f90
mpi_sa_mesh_Resum 0000000000AB8DEA output_files_mp_o 2741 output_files.f90
mpi_sa_mesh_Resum 0000000000B4162B MAIN__ 1030 MESH_driver.f90
mpi_sa_mesh_Resum 000000000040CC52 Unknown Unknown Unknown
libc-2.30.so 00002AAAAB626E1B __libc_start_main Unknown Unknown
mpi_sa_mesh_Resum 000000000040CB6A Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
mpi_sa_mesh_Resum 0000000000B6982E Unknown Unknown Unknown
libpthread-2.30.s 00002AAAAB5F60F0 Unknown Unknown Unknown
libhdf5.so.103.3. 00002AAAAEF9B222 H5AC_flush Unknown Unknown
libhdf5.so.103.3. 00002AAAAF067471 H5F__dest Unknown Unknown
libhdf5.so.103.3. 00002AAAAF068C17 H5F_try_close Unknown Unknown
libhdf5.so.103.3. 00002AAAAF0688AC H5F__close_cb Unknown Unknown
libhdf5.so.103.3. 00002AAAAF101CAE Unknown Unknown Unknown
libhdf5.so.103.3. 00002AAAAF217D10 H5SL_try_free_saf Unknown Unknown
libhdf5.so.103.3. 00002AAAAF101BA9 H5I_clear_type Unknown Unknown
libhdf5.so.103.3. 00002AAAAF06110E H5F_term_package Unknown Unknown
libhdf5.so.103.3. 00002AAAAEF7CD9A H5_term_library Unknown Unknown
libc-2.30.so 00002AAAAB63DFC7 Unknown Unknown Unknown
libc-2.30.so 00002AAAAB63E17A Unknown Unknown Unknown
mpi_sa_mesh_Resum 0000000000B5CCDC Unknown Unknown Unknown
mpi_sa_mesh_Resum 0000000000B6978A Unknown Unknown Unknown
libpthread-2.30.s 00002AAAAB5F60F0 Unknown Unknown Unknown
libhdf5.so.103.3. 00002AAAAEF9C78C H5AC_protect Unknown Unknown
libhdf5.so.103.3. 00002AAAAF153C59 H5O_protect Unknown Unknown
libhdf5.so.103.3. 00002AAAAF154FF0 H5O_pin Unknown Unknown
libhdf5.so.103.3. 00002AAAAF01C21B H5D__mark Unknown Unknown
libhdf5.so.103.3. 00002AAAAF01DC92 H5D__set_extent Unknown Unknown
libhdf5.so.103.3. 00002AAAAEFDF889 H5Dset_extent Unknown Unknown
libnetcdf.so.19 00002AAAAB8EE519 Unknown Unknown Unknown
libnetcdf.so.19 00002AAAAB8EDB36 NC4_put_vara Unknown Unknown
libnetcdf.so.19 00002AAAAB81CB20 nc_put_vara_int Unknown Unknown
libnetcdff.so.7.0 00002AAAAAB01DFF Unknown Unknown Unknown
libnetcdff.so.7.0 00002AAAAAB96B5E Unknown Unknown Unknown
mpi_sa_mesh_Resum 0000000000626C48 nc_io_mp_nc4_add_ 6457 nc_io.F90
mpi_sa_mesh_Resum 0000000000632B62 nc_io_mp_nc4_add_ 7869 nc_io.F90
mpi_sa_mesh_Resum 0000000000AB2487 output_files_mp_o 2377 output_files.f90
mpi_sa_mesh_Resum 0000000000AB73FC output_files_mp_o 2645 output_files.f90
mpi_sa_mesh_Resum 0000000000AB8DEA output_files_mp_o 2741 output_files.f90
mpi_sa_mesh_Resum 0000000000B4162B MAIN__ 1030 MESH_driver.f90
mpi_sa_mesh_Resum 000000000040CC52 Unknown Unknown Unknown
libc-2.30.so 00002AAAAB626E1B __libc_start_main Unknown Unknown
mpi_sa_mesh_Resum 000000000040CB6A Unknown Unknown Unknown
srun: error: cnic-giws-cpu-19001-02: task 0: Exited with exit code 174
slurmstepd: error: mpi/pmix_v4: _errhandler: cnic-giws-cpu-19001-02 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.1847143.0:0]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 1847143.0 ON cnic-giws-cpu-19001-02 CANCELLED AT 2023-11-07T23:38:53 ***
srun: error: cnic-giws-cpu-19001-02: tasks 1-31: Killed
Program finished with exit code 137 at: Tue Nov 7 23:38:54 CST 2023
The error occurs in subroutine 'nc4_add_data_1d_int' in nc_io.F90
; the line reads:
ierr = nf90_put_var(iun, vid, dat, start = start)
Which is a simple call to one of the nc library routines to write a variable. The calling routine nc4_add_data_xyt
is trying to write the time axis value for some output. I am writing daily output using nc format, the time in days is 43,504 > 32,768 which is the integer end range. I believe it should have stopped at 32,768 or if we consider it unsigned, then it should stop only when reaching 65,535.
While it is still weird, this error still makes more sense than the above mentioned cases with dynamic allocation errors. I guess the way out is to use long or even float for the time axis. If we have hourly output, the integer range will get exhausted in less than 4 years.
Any suggestions?
I checked another simulation where I had hourly output and the time axis reads:
int time(time) ;
time:units = "hours since 1951-01-01 00:00:00.000000"
The time is currently at (1,016,856) which much larger than the integer and unsigned integer ranges - there is something that I do not understand. How can nc handle an integer value that is out of range of integers?
Interesting, and yes that does make more sense. Compile the code with double integer (gfortran: -fdefault-integer-8
, ifort: -i8
) and see if it's any different. You'll have to modify the makefile (I don't think I preserved the 'double-integer' option when I updated it).
Will try that. But can you explain how nc still handles int time which is out of the int range?
do you want to go all the way to 8-byte integers instead of trying 4-byte ones first? and if we use 4-byte or 8-byte, wouldn't it be better to go to float? or that will make us lose some precision?
The default is 4-byte precision for integer. This will change it to 8-byte precision for integer (i.e., from int to long). Float only relates to type real. Those can be changed with the 'double' option with the existing makefile.
I thought the default is 2-byte. If the default for integers is 4-byte, then all the above becomes baseless; we are still within the range (-2,147,483,647 to 2,147,483,647) or (0 to 4,294,967,295) if unsigned.
Anyway, -i8 caused compile error:
./Driver/MESH_Driver/program_end.f90(16): error #6285: There is no matching specific subroutine for this generic subroutine call. [MPI_FINALIZE]
call MPI_Finalize(ierr)
---------^
compilation aborted for ./Driver/MESH_Driver/program_end.f90 (code 1)
make: *** [makefile:197: program_end.o] Error 1
I guess the MPI library is not expecting 8-byte integer in such a case. The problematic call is this line:
call MPI_Finalize(ierr)
What happens if in Modules/io_modules/nc_io.F90 you change line 5442 to:
ierr = nf90_create(fpath, ior(NF90_NETCDF4, NF90_CLASSIC_MODEL), iun)
I'm wondering if that might remove some of the HDF5 things... It should still produce a NetCDF4 file, but in the classic format. I don't think we use any special features of NetCDF4.
should I do that with the -i8 option? I mean: is this to overcome the compilation issue or is it to overcome the crash of long simulations?
It's not necessary. You determined it should have no effect.
to overcome the crash of long simulations
It's not necessary. You determined it should have no effect.
not necessary and not working either
It compiled but upon running, it couldn't create the nc output files:
READING: outputs_balance.txt
ERROR: An error occurred saving the 'lat' variable (Code -39).
ERROR: Errors occurred while applying the output configuration for 'SNO' (Line 1).
ERROR: An error occurred saving the 'lat' variable (Code -39).
ERROR: Errors occurred while applying the output configuration for 'FSNO' (Line 2).
ERROR: An error occurred saving the 'lat' variable (Code -39).
ERROR: Errors occurred while applying the output configuration for 'ZSNO' (Line 3).
ERROR: An error occurred saving the 'lat' variable (Code -39).
ERROR: Errors occurred while applying the output configuration for 'RFF' (Line 4).
ERROR: Errors occurred while reading outputs_balance.txt.
Abnormal exit.
1
srun: error: cnic-giws-cpu-19001-03: task 0: Exited with exit code 1
slurmstepd: error: mpi/pmix_v4: _errhandler: cnic-giws-cpu-19001-03 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.1853231.0:0]
slurmstepd: error: *** STEP 1853231.0 ON cnic-giws-cpu-19001-03 CANCELLED AT 2023-11-08T14:22:23 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: cnic-giws-cpu-19001-03: tasks 1-15: Killed
Program finished with exit code 137 at: Wed Nov 8 14:22:24 CST 2023
this is a grid-based setup
Given the above errors were related to nc output, I made a long run saving no nc output, just the csv basin average and reach stuff. Still crashed after 120 years of simulation:
[cnic-giws-cpu-19001-02:260329:0:260329] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aaadcccb000)
==== backtrace (tid: 260329) ====
0 0x000000000001f093 ucs_debug_print_backtrace() /tmp/ebuser/avx2/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/debug/debug.c:656
1 0x00000000000130f0 __funlockfile() :0
2 0x00000000009e4f35 rte_module_mp_run_rte_between_grid_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Routing_Model/RPN_watroute/sa_mesh_process/rte_module.f90:789
3 0x0000000000b00534 sa_mesh_run_between_grid_mp_run_between_grid_() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/sa_mesh_run_between_grid.f90:551
4 0x0000000000b41608 MAIN__() /gpfs/mdiops/globalhome/mee067/HPC/MESH/02_MESH_Code_EXE/r1860_ME.SUBBASINFLAG/./Driver/MESH_Driver/MESH_driver.f90:970
5 0x000000000040cc52 main() ???:0
6 0x0000000000023e1b __libc_start_main() /cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308
7 0x000000000040cb6a _start() ???:0
The line causing the issue in rte_module
reads:
if (fms%stmg%n > 0) qhyd(:, fhr) = real(fms%stmg%qomeas%val, kind(qhyd))
The condition seems to simple to think it can cause issues. I suspect the assignment may have an issue. Thus I checked:
qhyd
is declared in area_watflood
line 366. That module is used by rte_module
and is kind of shared.
real*4, dimension(:,:), allocatable :: qhyd,qsyn,qloc,
* delta,frc,frcs,
* qhyd_dly,qsyn_dly,qhyd_mly,qsyn_mly
qhyd is allocated on lines 404-414 of rte_module.F90
which fall within the run_rte_init
routine.
!> Streamflow gauge locations.
no = fms%stmg%n
if (fms%stmg%n > 0) then
allocate( &
iflowgrid(no), nopt(no), &
!todo: fix this (999999).
**qhyd(no, 999999))**
iflowgrid = fms%stmg%meta%rnk
nopt = -1
qhyd(:, 1) = real(fms%stmg%qomeas%val, kind(qhyd))
end if
I traced the structures and declarations of fms%stmg%qomeas%val
to find it as real (precision is non specified) - I think the typecasting is only effective when the code is compiled in double precision (which is not the case).
fhr
is an integer counter declared in area_watflood
- not sure what does it stand; looks like a time counter in hours? It is incremented by 1 before the line in question. Could it be causing dim overflow? 120 years x 24 * 365 = 1,051,200 hours (well within the integer range) but > 999999 - see the bolded line. I guess that's the issue. It was flagged as something to fix - but was not fixed so far. I will redo the test making it 9,999,999 but I do not think this should be the ultimate solution. This variable is used to store the observed values of station which are read from file at a given frequency. Why do we need to store all values? I think STATS are done is some cumulative way not to require this.
Code compiled using intel 2021 on Copernicus (make mpi_intel netcdf symbols). This uses openMPI 4.1.1, netcdf 4.8.0 and netcdf-fortran 4.5.3. I made sure all those modules are loaded during run time.
seems this 999,999 is used temporarily as big number for other things like reservoir releases, lake elevations, and so on. All marked with ! todo: fix this (999999)
comments. Has anybody worked on that @dprincz - maybe EG?
I checked the newer routing code (contains VF new solver and EG changes). The 999,999 thing was not fixed. It is a waste for memory in my opinion.
Where you find allocations using 999999
, replace it with 1
. Where you see fhr += 1
, comment that so fhr
always stays 1
.
so you do agree that we do not need to keep the full time series of those variables - are you sure nothing else will get affected?
As long as you update both those AND fhr
, nothing will be affected. To confirm, when you replace those variables allocated to (:, 999999)
to allocate them to (:, 1)
, you can scan the code with those variables names to see the second index is only ever taken using fhr
.
on the long run, those need to be vectors rather than 2D variables - no need to have a second dim if it always 1
There is only thing to worry me, in run_rte_resume_read
!> Read inital values from the file.
read(iun) fhr_i4
fhr = int(fhr_i4)
fhr could be something other than 1 if the value read from file isn't 1. Is this routine still used?
The variables are defined as arrays when WATFLOOD would read a full file of records from file. They should remain as-is, with the second dimension allocated to 1
.
Replace that line with a dummy read statement or update the line to assign it 1
in any case:
!> Read inital values from the file.
read(iun) fhr_i4
fhr = 1 !int(fhr_i4)
Are you saying the 2D thing will be needed if the routing is compiled as a stand-alone application?
run_rte_resume_read is used if the seq format is used for resuming. Reading an initial value, how could we have a value larger than 1 for fhr_i4?
just in from Mazda. A routing-only run starting 1951 and run using serially compiled mesh on a single core still crashed in 2073 due to a segmentation fault. The code was not compiled with "symbols" to show where it occurred but I could see something like "for_alloc_allocat
" in the error log.
This negates the previous information that the issue is MPI-related. @dprincz mentioned that the crash did not occur with Sujata when she ran a serial version. If the issue is that 999,999 thing - I believe MPI has nothing to do with it. It should occur either way.
I am still running tests with the changes advised above.
That was it. As far as I tested using code compiled with both intel 2018 and 2021, irrespective of having nc output. the 999999 fix has allowed both simulations to go all the way from 1951-2100 successfully 👍
Both simulations were MPI - so that was not the constraint. The annoying thing is that the error was caused by a certain statement but the 2018 version threw out the error somewhere else!
Awesome. Now all those !todo
's have finally been addressed. I'll close this now. Thanks for your efforts.
Should I reverse the static allocation related to those inline variables:
real, dimension(na) :: inline_qi, inline_stgch, inline_qo
before uploading the code changes. I will update my code version of course.
Leaving it as-changed should be fine.
This issue has been reported a few times by Sujata, Mazda, and possibly others. Long simulations crash at some point. I used not to face this issue for the MRB because simulations ran out of time (job time) before reaching such a crash point. I am running the Yukon long simulations now and got this crash in the year 2075 - the simulation started in 1951. Resuming the run from saved states (using SAVERESUME auto options) is a possible way to continue the simulation and get the results concatenated later. However, Mazda reported that resuming in routing-only mode is not done properly - I will try to reproduce and report that other issue separately. Here is a log of the crash:
This looks like a memory issue with the routing routine "route". I am re-running with a code compiled with "symbols" to try to locate the error but it takes much more time to run than a usual run (almost 3-4 times). Are there any variables in "route" that could be overflowing or exceeding their dims as time progresses?