MESAHub / mesa

Modules for Experiments in Stellar Astrophysics
http://docs.mesastar.org
GNU Lesser General Public License v2.1
138 stars 38 forks source link

proposal: error code instead of core dump when max_allowed_nz is met #621

Closed mjoyceGR closed 3 months ago

mjoyceGR commented 6 months ago

Edit: version mesa-r23.05.1

Per discussion with Yaguang Li @parallelpro:

when max_allowed_nz exceeds the default value of 8000

! mesh adjustment
! ===============
      ! max_allowed_nz
      ! ~~~~~~~~~~~~~~
      ! Maximum number of grid points allowed.
      ! ::
    max_allowed_nz = 8000

the response from MESA is a scary-looking core_dump.

Can we instead offer a termination code if nz = max_allowed_nz ?

Debraheem commented 6 months ago

Interesting, there should be message written to the terminal when exceeding max_allowed_nz?

https://github.com/MESAHub/mesa/blob/a965ec699b2b4978c43b0b305f40446b1aff05f0/star/private/mesh_plan.f90#L414-L417

and

https://github.com/MESAHub/mesa/blob/a965ec699b2b4978c43b0b305f40446b1aff05f0/star/private/mesh_plan.f90#L871-L874

In a brief test intentionally crashing a test_suite model, I get: "tried to increase number of mesh points beyond max allowed nz 1000

mesh_plan problem doing mesh_call_number 2009 s% model_number 2400

terminated evolution: adjust_mesh_failed termination code: adjust_mesh_failed "

Followed by a backtrace. I'm surprised it's not being returned in your attachment?

orlox commented 6 months ago

This is what I get just by setting the basic star work directory to a very small value of max_dq:

 tried to increase number of mesh points beyond max allowed nz        8000

 mesh_plan problem
 doing mesh_call_number           1
 s% model_number           1

terminated evolution: adjust_mesh_failed
termination code: adjust_mesh_failed
double free or corruption (!prev)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x72666e05a76f in ???
#1  0x72666e0ab32c in ???
#2  0x72666e05a6c7 in ???
#3  0x72666e0424b7 in ???
#4  0x72666e043394 in ???
#5  0x72666e0b52a6 in ???
#6  0x72666e0b737b in ???
#7  0x72666e0b7668 in ???
#8  0x72666e0b9e92 in ???
#9  0x4b96dc in __alloc_MOD_do2d
    at ../private/alloc.f90:1816
#10  0x4b9dff in do2
    at ../private/alloc.f90:1574
#11  0x4bb687 in __alloc_MOD_star_info_arrays
    at ../private/alloc.f90:512
#12  0x4c21a0 in __alloc_MOD_free_arrays
    at ../private/alloc.f90:281
#13  0x4c2234 in __alloc_MOD_free_star_data
    at ../private/alloc.f90:235
#14  0x40f7c2 in __star_lib_MOD_free_star
    at ../public/star_lib.f90:113
#15  0x42202a in __run_star_support_MOD_after_evolve_loop
    at ../job/run_star_support.f90:904
#16  0x426616 in __run_star_support_MOD_run1_star
    at ../job/run_star_support.f90:123
#17  0x40753c in __run_star_MOD_do_run_star
    at /home/pablom/work/mesa_versions/mesa-r23.05.1/star/job/run_star.f90:26
#18  0x4075dc in run
    at ../src/run.f90:16
#19  0x40761e in main
    at ../src/run.f90:2
./rn: line 6: 809122 Aborted                 (core dumped) ./star
DATE: 2024-02-15
TIME: 16:23:34
mjoyceGR commented 6 months ago

OK thanks, it's possible that the useful termination condition message was redirected elsewhere and we missed it

warrickball commented 6 months ago

I won't have time to look into this for >3 weeks but I think @mjoyceGR still has a point. Even if MESA detects the error and prints a message, why do we still hit the core dump & backtrace? My memory might be getting rusty while I'm not running MESA so much but I thought we usually exited relatively gracefully from known termination conditions.

mjoyceGR commented 6 months ago

I think @parallelpro is going to come here and post more about his data output configuration in a day or two to confirm the error message is truly missing, but a related question in the meantime:

Is there a way to set max_allowed_nz arbitrarily high? max_allowed_nz = -1 does not work (confirmed on 23051), though setting to -1 does permit arbitrarily high upper limits in a lot of other cases. I realize there may be a legitimate limitation against this in the case of nz given that it sets the size of the most crucial array/component.

parallelpro commented 6 months ago

Thank you all for the testing! I will explain a bit on my setup here. I set up a python script called driver.pyto modify an inlist template, and it contains the following snippet to initiate the mesa run and redirect all output to a log file.

print('------ MESA start ------')
os.system('sh rn > mesa_terminal_output_index{:06.0f}.txt'.format(index))
print('------ MESA done ------')

The last few lines of this mesa_terminal_output_index000000.txt log file is shown below (it looks like truncated - does that give a clue?):

       7320   7.789895   3655.816   2.955703   2.955706   1.000000   0.612600   0.000000   0.006532   0.272021  20.559330   7837      0
 3.0224E+00   7.771256   1.876429  -5.649074   1.789376 -99.000000   0.387400   0.986633   0.001747   0.013447   0.841667      6
 1.1999E+10   5.773430   2.959489  -2.230556 -18.794754  -8.017279   0.000000   0.000038   0.001396   0.013367  0.000E+00    varcontrol

save LOGS/profile3656.data LOGS/profile3656.data.FGONG for model 7320
save LOGS/profile3657.data LOGS/profile3657.data.FGONG for model 7322
save LOGS/profile3658.data LOGS/profile3658.data.FGONG for model 7324
save LOGS/profile3659.data LOGS/profile3659.data.FGONG for model 7326
save LOGS/profile3660.data LOGS/profile3660.data.FGONG for model 7328
       7330   7.790135   3655.106   2.956606   2.956609   1.000000   0.612467   0.000000   0.006532   0.272021  20.560022   7844      0
DATE: 2024-02-07
TIME: 08:48:33

I submitted the following job to an hpc cluster which uses slurm workload management system.

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=shared
#SBATCH --time=02-00:00:00 ## time format is DD-HH:MM:SS
#SBATCH --cpus-per-task=12
#SBATCH --mem=64G ## max amount of memory per node you require
#SBATCH --error=test-%A_%a.err ## %A - filled with jobid
#SBATCH --output=test-%A_%a.out ## %A - filled with jobid
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE,TIME_LIMIT_80
#SBATCH --mail-user=yaguangl@hawaii.edu

## All options and environment variables found on schedMD site: http://slurm.schedmd.com/sbatch.html

# record time
date
hostname

# change to zsh
# module purge
source /home/yaguangl/custom_setup.sh
source /home/yaguangl/.zshrc

# navigate to the mesa directory
cd template_sun_0.2/

# activate astro
micromamba activate astro

sh clean
sh mk
python driver.py 0

date

Upon completion, I received the following log file test-1110897_4294967294.out from slurm:

Wed Feb  7 00:23:17 UTC 2024
cn-02-03-06
gfortran -Wno-uninitialized -fno-range-check -fmax-errors=7  -fprotect-parens -fno-sign-zero -fbacktrace -ggdb -finit-real=snan -fopenmp -fbounds-check -Wuninitialized -Warray-bounds -ggdb -ffree-form -ffree-line-length-none -x f95-cpp-input -std=f2008 -Wno-error=tabs -I/home/yaguangl/mesa-r23.05.1/include -I../src -c ../src/run_star_extras.f90
gfortran -Wno-uninitialized -fno-range-check -fmax-errors=7  -fprotect-parens -fno-sign-zero -fbacktrace -ggdb -finit-real=snan -fopenmp -fbounds-check -Wuninitialized -Warray-bounds -ggdb -ffree-form -ffree-line-length-none -x f95-cpp-input -std=f2008 -Wno-error=tabs -I/home/yaguangl/mesa-r23.05.1/include -I../src -c /home/yaguangl/mesa-r23.05.1/star/job/run_star.f90
gfortran -Wno-uninitialized -fno-range-check -fmax-errors=7  -fprotect-parens -fno-sign-zero -fbacktrace -ggdb -finit-real=snan -fopenmp -fbounds-check -Wuninitialized -Warray-bounds -ggdb -ffree-form -ffree-line-length-none -x f95-cpp-input -std=f2008 -Wno-error=tabs -I/home/yaguangl/mesa-r23.05.1/include -I../src -c ../src/run.f90
gfortran -fopenmp -o ../star  run_star_extras.o run_star.o  run.o  -L/home/yaguangl/mesa-r23.05.1/lib -lstar -lgyre -latm -lcolors -lturb -lstar_data -lnet -leos -lkap -lrates -lneu -lchem -linterp_2d -linterp_1d -lnum -lauto_diff -lhdf5io -lmtx -lconst -lmath -lutils `mesasdk_crmath_link` `mesasdk_lapack95_link` `mesasdk_lapack_link` `mesasdk_blas_link` `mesasdk_hdf5_link`  `mesasdk_pgplot_link` -lz  -lgyre
Now calculating  index000000m1.000a1.9Y0.25Z0.013.mod
------ MESA start ------
------ MESA done ------
Wed Feb  7 08:48:35 UTC 2024

and the following error output test-1110897_4294967294.err.

/home/yaguangl/mesasdk/bin/../lib/gcc/x86_64-pc-linux-gnu/13.1.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: atm_support.o: requires executable stack (because the .note.GNU-stack section is executable)
double free or corruption (!prev)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x1552585447f2 in ???
#1  0x155258543985 in ???
#2  0x1552581a7daf in ???
#3  0x1552581f454c in ???
#4  0x1552581a7d05 in ???
#5  0x15525817b7f2 in ???
#6  0x15525817c12f in ???
#7  0x1552581fe616 in ???
#8  0x15525820030b in ???
#9  0x155258202954 in ???
#10  0x4b99bc in __alloc_MOD_do2d
    at ../private/alloc.f90:1816
#11  0x4ba11f in do2
    at ../private/alloc.f90:1559
#12  0x4bba17 in __alloc_MOD_star_info_arrays
    at ../private/alloc.f90:512
#13  0x4c2510 in __alloc_MOD_free_arrays
    at ../private/alloc.f90:281
#14  0x4c25a4 in __alloc_MOD_free_star_data
    at ../private/alloc.f90:235
#15  0x40f9fd in __star_lib_MOD_free_star
    at ../public/star_lib.f90:788
#16  0x4223ca in __run_star_support_MOD_after_evolve_loop
    at ../job/run_star_support.f90:903
#17  0x4269a6 in __run_star_support_MOD_run1_star
    at ../job/run_star_support.f90:66
#18  0x407906 in __run_star_MOD_do_run_star
    at /home/yaguangl/mesa-r23.05.1/star/job/run_star.f90:26
#19  0x40799e in run
    at ../src/run.f90:16
#20  0x4079e0 in main
    at ../src/run.f90:2
rn: line 6: 3018003 Aborted                 (core dumped) ./star

In all cases, the msg seen by @orlox on the mesh points is not seen on my system. Any ideas? Thanks again for looking into this!

warrickball commented 6 months ago

I just added max_dq = 1d-10 to a standard star/work folder with the latest development version of MESA and SDK 23.7.3 on Linux (Fedora 39) and didn't get the backtrace. I do get the backtrace if I also start from a main-sequence model by commenting out the lines with create_pre_main_sequence_model, Lnuc_div_L_zams_limit and stop_near_zams. I'll use this to start investigating when I have another chance to have a look.

Regarding the message not showing up for @parallelpro, that sounds like there might be some buffering in the output stream and MESA crashes before everything is written out.

A simple workaround, incidentally, is to just set max_allowed_nz to some crazy large value, like 100000. I don't mind unleashing it completely with e.g. -1 but I'm not sure how a system will behave if MESA tries to create too large a mesh. My hunch is that the OS will just kill the process, which is fine, but I'd like to rule out bringing anyone's system to an unusable halt first.

warrickball commented 6 months ago

I've opened PR #630 to try to fix the core dump. I'll investigate unleashing max_allowed_nz when I next have some MESA time.

warrickball commented 5 months ago

I experimented very briefly with absurdly large values of max_allowed_nz and I don't think there's a fundamental reason that it has to be limited. The OS should kill the job if it takes too much memory but with the basic net, MESA didn't fundamentally object to briefly creating a model with over a million mesh points, and it didn't crash the computer I tried it on.

MESA did crash during the remeshing, however, for some other reason that produced the same segfault as in this issue. There's still some pointer that we try to free before it's allocated if the remeshing fails but the one raised in this issue should now be a simple stop rather than a segfault, following my changes in #630. If @parallelpro can confirm that the fix I've made (which can be backported) turns the current segfault into an error, this can be closed. (MESA should still crash if you don't also increase max_allowed_nz.)

Looking further ahead, I propose that:

  1. We implement max_allowed_nz = -1 as an option that means the size of the mesh is unlimited.
  2. We retain a finite default, though 8000 is probably too low. (IMO the question is: how big a mesh indicates that something is going wrong in a calculation?)

I'll have a look at implementing -1 when I next get some MESA time™ and will start discussing 2 among the devs.

warrickball commented 5 months ago

@parallelpro What choice of mesh parameters is leading to more than 8000 zones, and in what stage of evolution?

parallelpro commented 5 months ago

Thanks @warrickball The configuration that leads to more than 8000 zones is setting mesh_delta_coeff = 0.1 for a resolution study. For a typical 1 Msun track, the subgiant phase could easily reach >8000 zones. I can test your MESA fix in the next few days.

Debraheem commented 3 months ago

The github branch tied to this issue has been successfully merged. Is it alright if I close this issue?

mjoyceGR commented 3 months ago

Yup, cheers