Closed benjaminrose closed 9 months ago
Running the FAIL_REPEAT
command (SNIaMODEL0-0002
) on the login node works just fine.
please make sure that FAIL_REPEAT jobs run in the $SNANA_DEBUG directory;
FATAL ERROR ABORT called by read_input_file Cannot open input file : 'simgen_SNIa.input'
Either copy this file locally or check permissions
Sorry. I copied the input files into $SNANA_DEBUG. Both RICK.CMD and FAIL_REPEAT*.CMD now run.
My test job works interactively, so my suspicion is memory consumption for the rather large HOSTLIB. I don't know where your submit_batch inputs are located, but check memory and whether it needs to be increased.
This is still failing, even with 40GB of memory. I have tried different size sims as well.
I have previously run a HOSTLIB with twice as many rows and VARNAMES list twice as big as yours (on Perlmutter, not Midway), so I am puzzled and wondering if perhaps the problem is not HOSTLIB related. To check if the crash is indeed a HOSTLIB problem, can you rerun with HOSTLIB_FILE NONE
I just got a segfault on the log in node. Look at FAIL_REPEAT_PIP_PIPPIN_ROMAN_TRANS_SN_SNIaMODEL0-0002_35061.CMD
in $PIPPIN_OUTPUT/PIPPIN_ROMAN_TRANS/1_SIM/SN/LOGS
. I won't touch anything for a while so you can have time to investigate.
There are also DREaMing-small.hostlib.gz
and DREaMing-medium.hostlib.gz
in the same folder as the main hostile. If they can help speed up debugging.
Not sure how your sim ran because the HOSTLIB_WGTMAP_FILE is not a wgtmap ... I fixed the sim to abort on invalid WGTMAP file.
I have removed the wgtmap issue and made a fast (6 s) reproducible command. I am still getting a seg fault at
******************************************************************
Begin Generating Lightcurves.
Found Max dN/dz * wgt = 9.355497e+05 at z = 1.292
I have put an example in $SNANA_DEBUG/segfault_bmr
The segfault still exists, even when using $NGRST_ROOT/starterKits/sim_host_redshift/3dhst_sim_input_cat_v1.7.hostlib
. See good_hostlib.cmd
in $SNANA_DEBUG/segfault_bmr
.
fill_TABLE_MWXT_SEDMODEL() is called from genmag_SALT2, but here the first event (z=5.1) has no bands within SALT2 model range and thus fill_TABLE_MWXT_SEDMODEL() is not called. But then genSpec_SALT2 is called and tries using the MWXT table that has not been allocated or filled.
Fix is to call fill_TABLE_MWXT_SEDMODEL() from genSpec_SALT2 ... it does nothing if already called from genmag_SALT2.
Three out of ten simulations end up with segfaults on light curve generation:
The three job logs and fail repeat scripts are available at
$SNANA_DEBUG/lightcurve_generation_segfault
.I am running them as part of a Pippin job (input
NGRST_USERS/brose3/roman/PIPPIN_ROMAN_TRANS.yml
, output$PIPPIN_OUTPUT/PIPPIN_ROMAN_TRANS/1_SIM/SN/
).For the jobs that did suceed, they were able to create ~7000 light curves.