RickKessler / SNANA

Supernova Analysis package
38 stars 23 forks source link

Seg Fault when generating light curves #1248

Closed benjaminrose closed 9 months ago

benjaminrose commented 11 months ago

Three out of ten simulations end up with segfaults on light curve generation:

******************************************************************
    Begin Generating Lightcurves.
  Found Max dN/dz * wgt = 9.355498e+05 at z =    1.290

 *** Break *** segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f79eeda060c in waitpid () from /lib64/libc.so.6
#1  0x00007f79eed1df62 in do_system () from /lib64/libc.so.6
#2  0x00007f79f3489e59 in TUnixSystem::StackTrace() () from /software/ROOT-5.34.14-el7-x86_64/lib/libCore.so
#3  0x00007f79f348ba5c in TUnixSystem::DispatchSignals(ESignals) () from /software/ROOT-5.34.14-el7-x86_64/lib/libCore.so
#4  <signal handler called>
#5  0x00000000005220cd in INTEG_zSED_SALT2 ()
#6  0x00000000005237e7 in genSpec_SALT2 ()
#7  0x000000000041532f in GENSPEC_TRUE ()
#8  0x0000000000422b8e in GENSPEC_DRIVER ()
#9  0x00000000004456b6 in main ()
===========================================================

The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00000000005220cd in INTEG_zSED_SALT2 ()
#6  0x00000000005237e7 in genSpec_SALT2 ()
#7  0x000000000041532f in GENSPEC_TRUE ()
#8  0x0000000000422b8e in GENSPEC_DRIVER ()
#9  0x00000000004456b6 in main ()
===========================================================

The three job logs and fail repeat scripts are available at $SNANA_DEBUG/lightcurve_generation_segfault.

I am running them as part of a Pippin job (input NGRST_USERS/brose3/roman/PIPPIN_ROMAN_TRANS.yml, output $PIPPIN_OUTPUT/PIPPIN_ROMAN_TRANS/1_SIM/SN/).


For the jobs that did suceed, they were able to create ~7000 light curves.

Done generating 6834 SN lightcurves from RANDOM source.
         (6834 lightcurves requested => 4951 were written)
         CPUTIME_PROCESS_ALL  = 19.817 minute
         CPUTIME_PROCESS_RATE(GEN) = 5.748 evt/second
         CPUTIME_PROCESS_RATE(ACC) = 4.164 evt/second
benjaminrose commented 11 months ago

Running the FAIL_REPEAT command (SNIaMODEL0-0002) on the login node works just fine.

RickKessler commented 11 months ago

please make sure that FAIL_REPEAT jobs run in the $SNANA_DEBUG directory;

FATAL ERROR ABORT called by read_input_file Cannot open input file : 'simgen_SNIa.input'

Either copy this file locally or check permissions

benjaminrose commented 11 months ago

Sorry. I copied the input files into $SNANA_DEBUG. Both RICK.CMD and FAIL_REPEAT*.CMD now run.

RickKessler commented 11 months ago

My test job works interactively, so my suspicion is memory consumption for the rather large HOSTLIB. I don't know where your submit_batch inputs are located, but check memory and whether it needs to be increased.

benjaminrose commented 10 months ago

This is still failing, even with 40GB of memory. I have tried different size sims as well.

RickKessler commented 10 months ago

I have previously run a HOSTLIB with twice as many rows and VARNAMES list twice as big as yours (on Perlmutter, not Midway), so I am puzzled and wondering if perhaps the problem is not HOSTLIB related. To check if the crash is indeed a HOSTLIB problem, can you rerun with HOSTLIB_FILE NONE

benjaminrose commented 10 months ago

I just got a segfault on the log in node. Look at FAIL_REPEAT_PIP_PIPPIN_ROMAN_TRANS_SN_SNIaMODEL0-0002_35061.CMD in $PIPPIN_OUTPUT/PIPPIN_ROMAN_TRANS/1_SIM/SN/LOGS. I won't touch anything for a while so you can have time to investigate.

benjaminrose commented 10 months ago

There are also DREaMing-small.hostlib.gz and DREaMing-medium.hostlib.gz in the same folder as the main hostile. If they can help speed up debugging.

RickKessler commented 10 months ago

Not sure how your sim ran because the HOSTLIB_WGTMAP_FILE is not a wgtmap ... I fixed the sim to abort on invalid WGTMAP file.

benjaminrose commented 10 months ago

I have removed the wgtmap issue and made a fast (6 s) reproducible command. I am still getting a seg fault at

******************************************************************
   Begin Generating Lightcurves.
  Found Max dN/dz * wgt = 9.355497e+05 at z =    1.292

I have put an example in $SNANA_DEBUG/segfault_bmr

benjaminrose commented 9 months ago

The segfault still exists, even when using $NGRST_ROOT/starterKits/sim_host_redshift/3dhst_sim_input_cat_v1.7.hostlib. See good_hostlib.cmd in $SNANA_DEBUG/segfault_bmr.

RickKessler commented 9 months ago

fill_TABLE_MWXT_SEDMODEL() is called from genmag_SALT2, but here the first event (z=5.1) has no bands within SALT2 model range and thus fill_TABLE_MWXT_SEDMODEL() is not called. But then genSpec_SALT2 is called and tries using the MWXT table that has not been allocated or filled.

Fix is to call fill_TABLE_MWXT_SEDMODEL() from genSpec_SALT2 ... it does nothing if already called from genmag_SALT2.