21cmfast / 21cmFAST

Official repository for 21cmFAST: a code for generating fast simulations of the cosmological 21cm signal
MIT License
58 stars 37 forks source link

Segmentation Fault when using photon conservation #311

Closed IvanNikolic21 closed 1 year ago

IvanNikolic21 commented 1 year ago

Describe the bug: 21cmFAST produces Segmentation Fault errors for certain values of parameters. This happens while running multinest within 21CMMC and only when using photon-conservation flag 'PHOTO_CONS=True'. Unfortunately, segmentation fault crashes the whole run when multinest is ran with mpirun.

To Reproduce: Using the following code with mpirun --map-by ppr:nproc:node run.py:

from astropy import cosmology
from py21cmmc import mcmc
from py21cmmc import LikelihoodNeutralFraction
from py21cmmc import CoreLightConeModule
from py21cmmc import LikelihoodLuminosityFunction, CoreLuminosityFunction
from py21cmmc import LikelihoodForest, CoreForest
from py21cmmc import LikelihoodPlanck
from py21cmmc import LikelihoodBase

user_params = {'HII_DIM':128, 'BOX_LEN':250.0, 'USE_INTERPOLATION_TABLES': True, 'USE_FFTW_WISDOM': True, "PERTURB_ON_HIGH_RES": True, "N_THREADS": 4, "OUTPUT_ALL_VEL": True}
cosmo_params = {'SIGMA_8': 0.8118, 'hlittle': 0.6688, 'OMm': 0.321, 'OMb':0.04952, 'POWER_INDEX':0.9626}
flag_options = {'USE_MASS_DEPENDENT_ZETA': True, "INHOMO_RECO": True, "PHOTON_CONS": True,}
global_params = {'Z_HEAT_MAX': 15.0, 'T_RE': 1e4, 'ALPHA_UVB': 2.0, 'PhotonConsEndCalibz':3.5}

model_name = "forest"

cosmo = cosmology.FlatLambdaCDM(H0=cosmo_params['hlittle']*100, Om0=cosmo_params['OMm'],Tcmb0=2.725,Ob0=cosmo_params['OMb'])

class Prior(LikelihoodBase):
    def reduce_data(self, ctx):
        params = ctx.getParams()
        return  [params.log10_f_rescale, params.f_rescale_slope]

    def computeLikelihood(self, model):
        return -0.5 * ((model[0] / 5)**2 + (model[1] / 2.5)**2)

core = [ CoreLightConeModule(redshift=4.9,max_redshift=15, debug_mode=True, user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreLuminosityFunction( redshift=6, sigma=0, name='lfz6', user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreLuminosityFunction( redshift=7, sigma=0, name='lfz7', user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreLuminosityFunction( redshift=8, sigma=0, name='lfz8', user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreLuminosityFunction( redshift=10, sigma=0, name='lfz10', user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreForest( redshift=5.0, name='bosman5pt0', n_realization=150, user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreForest( redshift=5.2, name='bosman5pt2', n_realization=150, user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreForest( redshift=5.4, name='bosman5pt4', n_realization=150, user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreForest( redshift=5.6, name='bosman5pt6', n_realization=150, user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreForest( redshift=5.8, name='bosman5pt8', n_realization=150, user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False),
         CoreForest( redshift=6.0, name='bosman6pt0', n_realization=150, user_params=user_params, cosmo_params=cosmo_params, flag_options=flag_options, global_params=global_params, regenerate=False,initial_conditions_seed = 1993,cache_dir=my_cache, cache_mcmc=False)]

likelihood = [ LikelihoodPlanck(), Prior(),
               LikelihoodNeutralFraction(),
               LikelihoodLuminosityFunction(name='lfz6', simulate = False, ),
               LikelihoodLuminosityFunction(name='lfz7', simulate = False, ),
               LikelihoodLuminosityFunction(name='lfz8', simulate = False, ),
               LikelihoodLuminosityFunction(name='lfz10', simulate = False, ),
               LikelihoodForest(name='bosman5pt0'),
               LikelihoodForest(name='bosman5pt2'),
               LikelihoodForest(name='bosman5pt4'),
               LikelihoodForest(name='bosman5pt6'),
               LikelihoodForest(name='bosman5pt8'),
               LikelihoodForest(name='bosman6pt0'),
               ]

mcmc_options = {'n_live_points': 1000,
        'importance_nested_sampling': False,
        'sampling_efficiency': 0.8,
        'evidence_tolerance': 0.5,
        'multimodal': True,
        'max_iter': 0,
        'n_iter_before_update': 1,
        'write_output': True}

chain = mcmc.run_mcmc(
    core, likelihood,    
    datadir='./',     
    model_name=model_name,   
    params=dict(  
        F_STAR10 = [-1.3, -2, -0.5, 0.5],
        ALPHA_STAR = [0.5, 0.0, 1.0, 0.5],
        M_TURN = [8.69897, 8, 10, 0.1],
        t_STAR = [0.5, 0.01, 1, 0.01],
        F_ESC10 = [-1, -3, 0, 1.0],
        ALPHA_ESC = [-0.5, -1.0, 1.0, 1.0],
        log10_f_rescale = [0, -5.0, 5.0, 5],
        f_rescale_slope = [0, -2.5, 2.5, 2.5],
    ),
    continue_sampling=True, 
    use_multinest=True,
    **mcmc_options
)

procudes the following error:


[r298:98568:0:98568] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:  98568) ====
 0  /jet/packages/spack/opt/spack/linux-centos8-zen2/gcc-10.2.0/ucx-1.9.0-hcxgpzalplefzn5amy3nzwdix4cpbift/lib/libucs.so.0(ucs_handle_error+0x254) [0x145ea0e9e5c4]
 1  /jet/packages/spack/opt/spack/linux-centos8-zen2/gcc-10.2.0/ucx-1.9.0-hcxgpzalplefzn5amy3nzwdix4cpbift/lib/libucs.so.0(+0x257a7) [0x145ea0e9e7a7]
 2  /jet/packages/spack/opt/spack/linux-centos8-zen2/gcc-10.2.0/ucx-1.9.0-hcxgpzalplefzn5amy3nzwdix4cpbift/lib/libucs.so.0(+0x25a7e) [0x145ea0e9ea7e] 
 3  /lib64/libpthread.so.0(+0x12dd0) [0x1460c1683dd0]
 4  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/libgsl.so.25(gsl_spline_eval+0) [0x145ea437ccc0]
 5  /jet/home/nikoli/programs/21cmFAST/21cmFAST/src/py21cmfast/c_21cmfast.abi3.so(adjust_redshifts_for_photoncons+0x353) [0x145ea483f5b3]
 6  /jet/home/nikoli/programs/21cmFAST/21cmFAST/src/py21cmfast/c_21cmfast.abi3.so(ComputeIonizedBox+0x260e) [0x145ea4851abe]
 7  /jet/home/nikoli/programs/21cmFAST/21cmFAST/src/py21cmfast/c_21cmfast.abi3.so(+0x48747) [0x145ea4853747]
 8  python(PyCFunction_Call+0x119) [0x4d5a09]
 9  python(_PyEval_EvalFrameDefault+0x558e) [0x4b882e]
10  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
11  python(_PyFunction_FastCallKeywords+0x29c) [0x4c638c]
12  python() [0x4bae2f]
13  python(_PyEval_EvalFrameDefault+0x15d2) [0x4b4872]
14  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
15  python(_PyFunction_FastCallKeywords+0x29c) [0x4c638c]
16  python() [0x4bae2f]
17  python(_PyEval_EvalFrameDefault+0x15d2) [0x4b4872]
18  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
19  python(_PyFunction_FastCallKeywords+0x29c) [0x4c638c]
20  python() [0x4bae2f]
21  python(_PyEval_EvalFrameDefault+0x15d2) [0x4b4872]
22  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
23  python(_PyFunction_FastCallDict+0x2d6) [0x4cd006]
24  python(_PyEval_EvalFrameDefault+0x1ea8) [0x4b5148]
25  python(_PyFunction_FastCallKeywords+0x106) [0x4c61f6]
26  python() [0x4bae2f]
27  python(_PyEval_EvalFrameDefault+0xa9e) [0x4b3d3e]
28  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
29  python(_PyFunction_FastCallKeywords+0x29c) [0x4c638c]
30  python() [0x4bae2f]
31  python(_PyEval_EvalFrameDefault+0xa9e) [0x4b3d3e]
32  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
33  python(_PyFunction_FastCallKeywords+0x29c) [0x4c638c]
34  python() [0x4bae2f]
35  python(_PyEval_EvalFrameDefault+0x971) [0x4b3c11]
36  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
37  python(_PyFunction_FastCallDict+0x2d6) [0x4cd006]
38  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x10698) [0x145eb0216698]
39  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/python3.7/lib-dynload/../../libffi.so.7(+0x67e1) [0x145eb02007e1]
40  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6b70) [0x145eb0200b70]
41  /jet/home/nikoli/local/lib/libmultinest_mpi.so(__cnested_MOD_loglike_f+0x69) [0x145e9c9c8b16]
42  /jet/home/nikoli/local/lib/libmultinest_mpi.so(__nested_MOD_gen_initial_live+0x206b) [0x145e9ca586a7]
43  /jet/home/nikoli/local/lib/libmultinest_mpi.so(__nested_MOD_nestsample+0xfa4) [0x145e9ca5a8cb]
44  /jet/home/nikoli/local/lib/libmultinest_mpi.so(__nested_MOD_nestrun+0x17b4) [0x145e9ca5c56f]
45  /jet/home/nikoli/local/lib/libmultinest_mpi.so(run+0x252) [0x145e9c9c89ab]
46  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x145eb02009dd]
47  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x145eb0200067]
48  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2f7) [0x145eb02153c7]
49  /jet/home/nikoli/miniconda3/envs/21cmmc/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0xed99) [0x145eb0214d99]
50  python(PyObject_Call+0x51) [0x4d35a1]
51  python(_PyEval_EvalFrameDefault+0x1ea8) [0x4b5148]
52  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
53  python(_PyFunction_FastCallKeywords+0x29c) [0x4c638c]
54  python() [0x4bae2f]
55  python(_PyEval_EvalFrameDefault+0x15d2) [0x4b4872]
56  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
57  python(_PyFunction_FastCallDict+0x2d6) [0x4cd006]
58  python(_PyEval_EvalFrameDefault+0x1ea8) [0x4b5148]
59  python(_PyEval_EvalCodeWithName+0x201) [0x4b2041]
60  python(PyEval_EvalCodeEx+0x39) [0x4b1e39]
61  python(PyEval_EvalCode+0x1b) [0x5537fb]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 106 with PID 98568 on node r298 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I investigated which parameters crash the code, and in this instance it's this one:

[({'ALPHA_ESC': -0.9482376575469971, 'ALPHA_STAR': 0.1368311643600464, 'ALPHA_STAR_MINI': 0.5, 'A_LW': 2.0, 'A_VCB': 1.0, 'BETA_LW': 0.6, 'BETA_VCB': 1.8, 'F_ESC10': -0.16299712657928467, 'F_ESC7_MINI': -2.0, 'F_H2_SHIELD': 0.0, 'F_STAR10': -0.5723588466644287, 'F_STAR7_MINI': -2.0, 'HII_EFF_FACTOR': 30.0, 'INHOMO_RECO': False, 'ION_Tvir_MIN': 4.69897, 'L_X': 40.0, 'L_X_MINI': 40.0, 'M_TURN': 8.044878363609314, 'N_RSD_STEPS': 20, 'X_RAY_SPEC_INDEX': 1.0, '_NU_X_THRESH': 500.0, '_name': 'AstroParams', '_t_STAR': 0.20997181951999666}, array(98568.))]

To see whether the same thing happens with 'PHOTO_CONS'=False, I launched a lightcone with the same parameters (only with alpha_esc scaled by -0.2, according to Park+22) and found no segmentation fault, while 'PHOTO_CONS'=True again reproduced the segmentation fault.

Expected behavior:

The parameters that create this problem are in the corners of the prior space. Therefore their EoR histories are somewhat exotic, in this case, escape fraction is 1.0 for almost all halos, and SFRD is quite high, with very low feedback. Therefore, this is only a problem in 21cmmc samplers, and only in the first few steps. Therefore, a solution could be to disregard the 'PHOTO_CONS' flag for certain parameter values if they are expected to crash, possibly scaling 'ALPHA_ESC' as well.

andreimesinger commented 1 year ago

It seems to me that a straightforward, short term "fix" could be to do a check using the analytic EoR history (dQHII/dt = dnion/dt) if the results are reasonable and fall within the interpolation tables. If not, rather than having the code crash, it could just output a warning message and run without photon conservation, but with an alpha_esc that is decreased (just internally) by 0.2 (according to Park, Greig, Mesinger 2022). What do you think @BradGreig?

BradGreig commented 1 year ago

Hey @IvanNikolic21 @andreimesinger do you have the reionisation history available for this specific parameter set? (analytic, calibration or with Photon Cons turned off). I haven't found the time to run it myself. Just want to see which class it is, as there are a few different corners of parameter space that lead to different possible causes of failure.

I'm wondering if there might be an easier way to decide simply off the analytic and/or calibration curve. If the history is too extreme, one can then simply rule out the model rather than trying to sample it (since it shouldn't be allowed by most other priors).

Note, technically speaking just modifying alpha_esc by 0.2 would not necessarily be correct. It is 0.2 for the model considered in the paper, but it doesn't need to be uniformly 0.2 for all possible models. However, for this it is probably a reasonable approximation.

andreimesinger commented 1 year ago

Thanks @BradGreig , I think @IvanNikolic21 mentioned that it had a super early EoR history, but he can comment more in detail. But indeed deciding based on the analytic curve is exactly what I am suggesting above. I don't think it is sufficient though to return a likelihood of zero in 21cmMC, since the code would in any case crash if the user is just running 21cmFAST. A more stable solution, imo is to do as I suggested above inside 21cmFAST: (i) using the analytic curve, check if tau_e is within, say 3 sigma of Planck, or if EoR ends before/after z= whatever; (ii) if it does, turn off photon conservation for that one run, and run without PC but with an internally decreased alpha_esc, printing out a corresponding warning. In the end these runs will not matter as they are already ruled out, but i think it is the most stable solution.

steven-murray commented 1 year ago

I'm open to either way, or even having a GLOBAL option that switches between the two behaviours. I.e., by default, raise an exception when it's going bananas (this exception can be handled in 21cmMC as a "non-fatal" exception, and a loglike of -inf returned, and within 21cmFAST it would just crash, but with a useful exception message). But you could switch on global_params.ignore_photoncons_errors = True or something, and in that case it would only warn you, and do what @andreimesinger suggests.

IvanNikolic21 commented 1 year ago

Just to follow up on the EoR history for this particular parameter set. Here is the EoR history, and as I said, it ends very early. This curve is obtained w/o photo conservation, with alpha_esc scaled by 0.2 xh_evolution

BradGreig commented 1 year ago

Hey @IvanNikolic21 @andreimesinger I did some further digging into this, mostly because an early EoR should be able to be handled by the Photon Conservation. Most of the problems I am familiar with occur for models with little to no star-formation, that is reionisation stagnates or never happens.

The cause of the segmentation fault in the above case is simply because it tries to access a function that hasn't yet been initialised. This has happened because you have a Z_HEAT_MAX = 15, and the first point just so happens to fall outside the conditions where the requisite function would have been initialised. By adding in this initialisation step, the model completes without segmentation fault.

I've created a PR with this change (#312), which should be able to handle this specific failure.

However, I think a broader discussion is required for how better to treat and handle failures of the photon conservation correction in the corners of parameter space.

qyx268 commented 1 year ago

brad's PR should solve the problem with PC when running py21cmmc and close this issue. A new issue should be raised for how to better deal with PC with very late EoR, which imo is a low-priority...

andreimesinger commented 1 year ago

many thanks brad!

On Fri, 11 Nov 2022 at 08:21, Yuxiang Qin @.***> wrote:

brad's PR should solve the problem with PC when running py21cmmc and close this issue. A new issue should be raised for how to better deal with PC with very late EoR, which imo is a low-priority...

— Reply to this email directly, view it on GitHub https://github.com/21cmfast/21cmFAST/issues/311#issuecomment-1311320008, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH2UAQDF4QWZDYFXVLN7GDWHXXXLANCNFSM6AAAAAAR2LIZKE . You are receiving this because you were mentioned.Message ID: @.***>