steven-murray commented 3 years ago

Addresses #63

This PR does a few things:

Adds the ability to partially read datasets, by using, eg. init_box.read(keys=['hires_density', 'hires_vx'])
Adds ability to specify which arrays should get initialized given the input parameters, via a _get_box_structures() method. The code will then only check that those boxes are initialized, rather than everything. This is super important for initial conditions -- it means 2LPT and relative velocities boxes don't need to be initialized.
It tries to add the knowledge of which particular boxes are required in each struct for the computation of other structs (via the get_requirements method). The idea here was to be able to do something like, when calling eg. ComputeTsBox, know that we don't need all of the bits of InitialConditions, and delete some of them from memory. However, this comes with a lot of problems -- there are instances where in one step you might want to delete some of the memory, but you'd need it back in the next step. If you are caching, this is "fine" in the sense that it can be auto-read back in (though that's still not optimal), but if no caching is on, then that information is just gone forever. So this is perhaps a road we shouldn't go down.

steven-murray commented 3 years ago

@BradGreig I'd love your opinions here. I think point 1 and 2 above are solid. I'm just trying to figure out point 3. In my mind, the real benefit here is to be able to 1) get rid of half the initial conditions almost straight away (only need either high or low res in PerturbField), and 2) completely remove the ICs from memory after the PerturbField is calculated.

Problematically, the perturb field is calculated for lots of redshifts. If things are calculated in the way where we have an outer loop over redshift, and then at each redshift the perturb field, then spin temp then ionize box are calculated, then if we remove the ICs automatically when we calculate the spin temp, we'd need to read them back in again for the next redshift's perturb field. Even worse if we're not caching.

So that's clearly not an option. Perhaps with just the ability to do partial load, we can just manually build in the dropping out of some of the IC memory in the higher level functions (run_coeval' /run_lightcone`) by doing the loop over perturb fields for all redshifts first, then dropping everything for the rest of the boxes?

Even then, I'm worried that the user passes in a pre-calculated IC struct to run_lightcone, which they never cached to disk, and then we just auto-empty it within the function. We need to avoid that happening.

Was hoping you could offer some more insight :-)

BradGreig commented 3 years ago

Hey @steven-murray, yep, I agree with you about points 1 and 2. These cover the majority of the issues leading to this PR.

I also agree about point 3. This is certainly non-trivial, and I worry it might get a bit ugly trying to guard against all possible permutations of misuses.

There are a couple of possible high-level quick fixes that might deal with this, though certainly these need to be discussed with respect to pro's and con's.

One could enforce that, if run_coeval or run_lightcone is called, that the ICs must be saved to disk. The benefit here is that these would always be accessible (readable from cache) when required, avoiding any issues with removal of data that would later be needed. The only real downside I can see to this is the file size that may need to be kept (i.e. enforcing a large file to be stored which might take up some/all of a user's disk quota). I don't think there is a big downside to this, but I'm also not 100% sure this is the kind of behaviour we would want. I guess this could start to be an issue once 21CMMC is used when varying the random seed (rather than just using one ICs set for all realisations).
Certainly one could separate out all perturb fields and store them to cache in most instances, however, this will become problematic when the code is called requiring back-filling of data. For example, a call of run_coeval(redshift=8.0, USE_TS_FLUCT=True) checks to see if the data exists, if it doesn't it goes back in redshift to produce it (and it does this for all boxes, not just perturb). I guess it is possible to have this separable in the same way too, but I worry that it might become more complicated (or not fool proof given other files may also be required). I think the point of having the perturb field stuff embedded with the main calculation was to minimise I/O as much as possible. But I guess that only works for small-ish boxes (e.g. MCMCable boxes). Perhaps separating perturb field will be needed anyway, regardless of the adopted best approach, or be able to flip between one or the other (i.e. do all perturb's at once, or on the fly).
Somewhat extending your idea, I wonder if it is possible to have a more granular caching system. Currently we have write = True/False, i.e. write everything or nothing. Perhaps this can be mitigated with a third option, which is to write/cache the bare minimum required. This should ensure that everything necessary for performing any calculation is always accessible from cache. Further, if a user passes their own ICs, this should be catch-able by this system and store what's necessary from those passed ICs prior to performing the full calculation. I guess the downside of this one is that the onus is on us to ensure we have correctly identified the minimal set of data to be cached. Although on the flip-side the positive is that it should be easy to use and manage for the user. I don't see this being too problematic though, given the built infrastructure of get_requirements etc.

I think I tend to lean towards 3. Although it may be less flexible, I think it should minimise the most issues on the user side. In any case, I suspect we may have to review the implementations of the higher level functions to decide if they are efficient (e.g. how perturb field is calculated).

steven-murray commented 3 years ago

Thanks @BradGreig. I think I also lean towards 3 -- having more granular writing is always useful anyway, and we could exploit it to optimize this issue. I'll have to think more about how exactly to implement this, and what the precise drawbacks are. I don't expect to be able to get to it til Thursday.

steven-murray commented 3 years ago

@BradGreig I've now added in the logic to run_coeval and run_lightcone to actually purge the IC memory as needed. I also moved the computation of the perturb fields out of the big loop, so they're all pre-computed. This is not necessarily perfectly efficient, but it's the best I could think of for now.

I'm sure that within the current structure, there are more memory gains to be made. I figure you'll know where they might be.

Let me know if the code doesn't make sense!

steven-murray commented 3 years ago

@BradGreig thanks for the review. I've fixed those couple of things you mentioned. I'm not sure why integration tests are failing. Nothing should be changed about the results of the computation. And certainly nothing completely wild is happening -- the new results are "closeish" to the old ones, just not close enough. I'll have to investigate.

steven-murray commented 3 years ago

A progress report.

I've modified the produce_integration_test_data script so that the coeval and lightcones now output power spectra for lots of intermediate boxes, instead of just brightness temperature, which should make it easier to identify in the future where the breakages are. I've also added a little plot output (which can be turned off) in the tests, so that the original vs new can be plotted, to check on the overall features.

After fixing some of the bugs, I have gotten most of the tests to pass, but some still breaking, and sometimes segfaulting. Unfortunately they seem to be ones that use either Ts or mini-halos, which makes it harder to debug :-(.

Here's a plot of the power spectra for lots of quantities in coeval:

tests test_integration_features py--test_power_spectra_coeval 18-kwargs12

This is for the test:

18,
        {
            "z_heat_max": 25,
            "USE_MINI_HALOS": True,
            "USE_MASS_DEPENDENT_ZETA": True,
            "INHOMO_RECO": True,
            "USE_TS_FLUCT": True,
            "zprime_step_factor": 1.1,
            "N_THREADS": 4,
            "USE_FFTW_WISDOM": True,
            "NUM_FILTER_STEPS_FOR_Ts": 8,
        },

You can see that it's just in ionioze_box that things start breaking, especially dNrec. I don't think I've changed anything in the C code, so it's got to be something with how memory is initialized. Will keep digging...

steven-murray commented 3 years ago

@BradGreig and @qyx268: I've made some further updates that have fixed many of the problems. At the moment, the z_re_box is failing its power spectrum comparison (I know a power spectrum of z_re_box isn't really useful, but it should still give the same answer!).

I'm a bit confused because I added some SUPER_DEBUG statements into the code (and added the same ones into the master branch), and I'm printing out the values of z_re_box in every R loop, and they are exactly the same between this branch and master, but the power spectrum comes out different.

Do you have any idea what might cause this? I'll keep working on it next week.

qyx268 commented 3 years ago

If they are the exactly the same, then the power spectra should be the same, isn't it? The only two things that are special about z_re_box are that (1) they are initialized to -1 instead of 0 like Gamma12_box or MFP_box; and (2) when making LC of z_re, we use np.max instead of np.mean (check _interpolate_in_redshift). I'm not sure if the new PS or the old one in master has considered this or not. But in principle, if there is a mismatch in how to make z_re LC between here and in master, only lightcone PS should fail...

I cant really tell from the test log if the PS failing is because of z_re. It looks like there are other tests failing, e.g. in ubuntu-3.8

tests/test_initial_conditions.py::test_box_shape FAILED                  [ 19%]
tests/test_initial_conditions.py::test_modified_cosmo PASSED             [ 20%]
tests/test_initial_conditions.py::test_transfer_function FAILED          [ 20%]

steven-murray commented 3 years ago

Thanks @qyx268 and sorry for my previous very brief report. I'll fill out some details here:

So, what I have going at the moment locally is that I have a branch directly off of master where I'm doing the production of the coevals/lightcones (the only things changed in that branch are logging statements essentially). Then I have this branch.

I have also added the ability to plot within the tests, which has helped considerably to identify where things are going wrong (this is turned off on Github Actions, but you can turn it on locally with --plots when calling pytest). I've also added (to both branches) that many boxes are saved (both for coeval and lightcone), and we test on all of them (and make power spectra plots for all of them, if that feature is turned on). This has helped considerably. Also, I've added a bunch of logging statements to both branches, printing out summaries of different boxes at different places in the code. I then can run a particular test case with both branches, and check the differences.

All the coeval boxes are passing fine, it's just the lightcones that are failing. Interestingly, it seems that (some) tests that are failing when I run locally are passing on Github.

Like I said, it seems that z_re_box is failing locally for the very first integration test (i.e. kwargs0) LOCALLY. It seems to be passing on Github, so that's even more confusing. But the debug statements show perfect similarity between both branches, even locally. So it's very confusing... I'll look into this max vs mean thing!

tests.test_integration_features.py--test_power_spectra_lightcone[12-kwargs0].pdf

qyx268 commented 3 years ago

Thanks for the explanation @steven-murray. Looking at the small difference, I think the two branches might be using the same method (possibly all using the mean, which also needs to be changed to max if people really require a LC of z_re)

At such high redshift, I'm actually expecting 99% of the cells to have z_re = -1 because they have never been ionized. The unit in the PS is not very useful in this case. Could you maybe compare the two z_re LC directly? (or you have already?) I'm wondering if powerbox (or other power spectrum tool you are using) can handle an input mostly full of -1...

steven-murray commented 3 years ago

Hey @qyx268, yeah, I'll check whether it's using mean/max. I also had the thought that having mostly -1's might be a problem for powerbox, but I doubt it would. Unfortunately, I don't have time to look further into this til Thursday.

BradGreig commented 3 years ago

There are a couple of different facets to this that need considering, and this all depends on desired behaviour.

But first, even running steps 2 and 3 independently is producing different results for the brightness temperature off the master branch (with an empty cache each time). Though, the difference is minor. However, these should be exact given it's the same computation. I don't know why that is happening, but it leads me to believe there must be something changing (possibly a memory leak or something somewhere). Ionisation fraction seems fine, it's just the brightness temperature.

Now, back to the question at hand.

Firstly, as far as I can tell this has only appeared as now in the integration tests all quantities are checked rather than just the brightness temperature. Meaning, the brightness temperature will be passing all checks. So the question is does it even make sense to be checking all quantities. I can see the value to it, but it only makes sense if the quantity is always stored.

Now, there is no harm in keeping the z_re_box even if INHOMO_RECO=False or USE_MINI_HALOS=False. What z_re_box does is store a cells ionisation redshift. This is perfectly reasonable to keep for any light-cone options. Just because it's not used in a calculation doesn't mean that it isn't a useful quantity to have access to. In that regard I'd vote against anything that just sets previous_ionize_box = False.

Importantly, z_re_box should not be compared between coeval and lightcone quantities (for the obvious reason that the ionisation redshifts will differ).

Now, given I am noticing differences in the outputs from various runs, I think that is going to make it hard to check what's happening with z_re_box.

I wonder if there is a way we can store some additional metadata into the stored files, so it knows when it finds an existing snapshot whether that was in a lightcone or co-eval context (so that it knows whether or not to trust these evolving quantities). Not sure if it already does this in the hash.

qyx268 commented 3 years ago

There are a couple of different facets to this that need considering, and this all depends on desired behaviour.

But first, even running steps 2 and 3 independently is producing different results for the brightness temperature off the master branch (with an empty cache each time). Though, the difference is minor. However, these should be exact given it's the same computation. I don't know why that is happening, but it leads me to believe there must be something changing (possibly a memory leak or something somewhere). Ionisation fraction seems fine, it's just the brightness temperature.

Now, back to the question at hand.

Firstly, as far as I can tell this has only appeared as now in the integration tests all quantities are checked rather than just the brightness temperature. Meaning, the brightness temperature will be passing all checks. So the question is does it even make sense to be checking all quantities. I can see the value to it, but it only makes sense if the quantity is always stored.

Now, there is no harm in keeping the z_re_box even if INHOMO_RECO=False or USE_MINI_HALOS=False. What z_re_box does is store a cells ionisation redshift. This is perfectly reasonable to keep for any light-cone options. Just because it's not used in a calculation doesn't mean that it isn't a useful quantity to have access to. In that regard I'd vote against anything that just sets previous_ionize_box = False.

Importantly, z_re_box should not be compared between coeval and lightcone quantities (for the obvious reason that the ionisation redshifts will differ).

Now, given I am noticing differences in the outputs from various runs, I think that is going to make it hard to check what's happening with z_re_box.

I wonder if there is a way we can store some additional metadata into the stored files, so it knows when it finds an existing snapshot whether that was in a lightcone or co-eval context (so that it knows whether or not to trust these evolving quantities). Not sure if it already does this in the hash.

this is supposed to be in #234

BradGreig commented 3 years ago

Haha, yes, that is correct @qyx268. I consider them intertwined.

steven-murray commented 3 years ago

@BradGreig and @qyx268. There appears to be a segfault in at least one of the halo field integration tests.

For reference the test that is segfaulting (on Github actions) is:

8.5,
        {
            "USE_MASS_DEPENDENT_ZETA": True,
            "USE_HALO_FIELD": True,
            "USE_TS_FLUCT": True,
            "z_heat_max": 25,
            "zprime_step_factor": 1.1,
        },

I ran it through valgrind (takes a looooong time) which output the following two errors:

==592184== Use of uninitialised value of size 8
==592184==    at 0x48E9D31: __printf_fp_l (in /usr/lib/libc-2.33.so)
==592184==    by 0x48FEE03: printf_positional (in /usr/lib/libc-2.33.so)
==592184==    by 0x490077C: __vfprintf_internal (in /usr/lib/libc-2.33.so)
==592184==    by 0x4902BB3: buffered_vfprintf (in /usr/lib/libc-2.33.so)
==592184==    by 0x48ED569: fprintf (in /usr/lib/libc-2.33.so)
==592184==    by 0x51B7E019: ComputeInitialConditions (GenerateICs.c:256)
==592184==    by 0x51B80A6F: _cffi_f_ComputeInitialConditions (py21cmfast.c_21cmfast.c:1213)
==592184==    by 0x23FE86: cfunction_call_varargs (call.c:772)
==592184==    by 0x23FE86: PyCFunction_Call (call.c:788)
==592184==    by 0x2D92E2: do_call_core (ceval.c:4641)
==592184==    by 0x2D92E2: _PyEval_EvalFrameDefault (ceval.c:3191)
==592184==    by 0x21DB47: _PyEval_EvalCodeWithName (ceval.c:3930)
==592184==    by 0x2712B6: _PyFunction_FastCallKeywords (call.c:433)
==592184==    by 0x2D4B83: call_function (ceval.c:4616)
==592184==    by 0x2D4B83: _PyEval_EvalFrameDefault (ceval.c:3139)
==592184==  Uninitialised value was created by a stack allocation
==592184==    at 0x51B7D82E: ComputeInitialConditions (GenerateICs.c:99)

==592184== Conditional jump or move depends on uninitialised value(s)
==592184==    at 0x48E9D35: __printf_fp_l (in /usr/lib/libc-2.33.so)
==592184==    by 0x48FEE03: printf_positional (in /usr/lib/libc-2.33.so)
==592184==    by 0x490077C: __vfprintf_internal (in /usr/lib/libc-2.33.so)
==592184==    by 0x4902BB3: buffered_vfprintf (in /usr/lib/libc-2.33.so)
==592184==    by 0x48ED569: fprintf (in /usr/lib/libc-2.33.so)
==592184==    by 0x51B7E019: ComputeInitialConditions (GenerateICs.c:256)
==592184==    by 0x51B80A6F: _cffi_f_ComputeInitialConditions (py21cmfast.c_21cmfast.c:1213)
==592184==    by 0x23FE86: cfunction_call_varargs (call.c:772)
==592184==    by 0x23FE86: PyCFunction_Call (call.c:788)
==592184==    by 0x2D92E2: do_call_core (ceval.c:4641)
==592184==    by 0x2D92E2: _PyEval_EvalFrameDefault (ceval.c:3191)
==592184==    by 0x21DB47: _PyEval_EvalCodeWithName (ceval.c:3930)
==592184==    by 0x2712B6: _PyFunction_FastCallKeywords (call.c:433)
==592184==    by 0x2D4B83: call_function (ceval.c:4616)
==592184==    by 0x2D4B83: _PyEval_EvalFrameDefault (ceval.c:3139)
==592184==  Uninitialised value was created by a stack allocation
==592184==    at 0x51B7D82E: ComputeInitialConditions (GenerateICs.c:99)

==592184== Invalid read of size 4
==592184==    at 0x205D1A: address_in_range (obmalloc.c:1363)
==592184==    by 0x205D1A: pymalloc_free.isra.0 (obmalloc.c:1633)
==592184==    by 0x205D1A: _PyObject_Free (obmalloc.c:1838)
==592184==    by 0x205D1A: PyMem_Free (obmalloc.c:577)
==592184==    by 0x21BE97: state_fini (_sre.c:478)
==592184==    by 0x2ECC5C: scanner_dealloc (_sre.c:2399)
==592184==    by 0x289397: meth_dealloc (methodobject.c:91)
==592184==    by 0x2F2C39: calliter_iternext (iterobject.c:230)
==592184==    by 0x2BF6B3: list_extend (listobject.c:901)
==592184==    by 0x2BF6B3: list___init___impl (listobject.c:2688)
==592184==    by 0x2BF6B3: list___init__ (listobject.c.h:247)
==592184==    by 0x285F77: type_call (typeobject.c:971)
==592184==    by 0x285F77: _PyObject_FastCallKeywords (call.c:199)
==592184==    by 0x2D8145: call_function (ceval.c:4619)
==592184==    by 0x2D8145: _PyEval_EvalFrameDefault (ceval.c:3124)
==592184==    by 0x27102A: function_code_fastcall (call.c:283)
==592184==    by 0x27102A: _PyFunction_FastCallKeywords (call.c:408)
==592184==    by 0x2D3AC5: call_function (ceval.c:4616)
==592184==    by 0x2D3AC5: _PyEval_EvalFrameDefault (ceval.c:3124)
==592184==    by 0x27102A: function_code_fastcall (call.c:283)
==592184==    by 0x27102A: _PyFunction_FastCallKeywords (call.c:408)
==592184==    by 0x2D3D3F: call_function (ceval.c:4616)
==592184==    by 0x2D3D3F: _PyEval_EvalFrameDefault (ceval.c:3110)
==592184==  Address 0x727b020 is 608 bytes inside a block of size 720 free'd
==592184==    at 0x483F9AB: free (vg_replace_malloc.c:538)
==592184==    by 0x51FD25BF: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x52023066: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x5202145E: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x5201FE06: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x5201FE06: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x51B58136: dft_r2c_cube (dft.c:48)
==592184==    by 0x51B7A7E5: ComputeHaloField (FindHaloes.c:108)
==592184==    by 0x51B80712: _cffi_f_ComputeHaloField (py21cmfast.c_21cmfast.c:1143)
==592184==    by 0x23FE86: cfunction_call_varargs (call.c:772)
==592184==    by 0x23FE86: PyCFunction_Call (call.c:788)
==592184==    by 0x2D92E2: do_call_core (ceval.c:4641)
==592184==    by 0x2D92E2: _PyEval_EvalFrameDefault (ceval.c:3191)
==592184==    by 0x21DB47: _PyEval_EvalCodeWithName (ceval.c:3930)
==592184==  Block was alloc'd at
==592184==    at 0x4841118: memalign (vg_replace_malloc.c:907)
==592184==    by 0x51FC81C5: fftwf_malloc_plain (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x51FD24EC: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x52023066: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x5202145E: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x5201FE06: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x5201FE06: ??? (in /usr/lib/libfftw3f.so.3.6.9)
==592184==    by 0x51B58136: dft_r2c_cube (dft.c:48)
==592184==    by 0x51B7A7E5: ComputeHaloField (FindHaloes.c:108)
==592184==    by 0x51B80712: _cffi_f_ComputeHaloField (py21cmfast.c_21cmfast.c:1143)
==592184==    by 0x23FE86: cfunction_call_varargs (call.c:772)
==592184==    by 0x23FE86: PyCFunction_Call (call.c:788)
==592184==    by 0x2D92E2: do_call_core (ceval.c:4641)
==592184==    by 0x2D92E2: _PyEval_EvalFrameDefault (ceval.c:3191)

The top two sections here point to the same debug print statement in initial conditions where we print out the RNG numbers a and b. I can't for the life of me think of why these numbers would not be initialized -- the code seems to say they definitely are. But anyway, I don't think that's the real problem here. The last block points to the following bit of code:

        // allocate array for the k-space box
        density_field = (fftwf_complex *) fftwf_malloc(sizeof(fftwf_complex)*KSPACE_NUM_PIXELS);
        density_field_saved = (fftwf_complex *) fftwf_malloc(sizeof(fftwf_complex)*KSPACE_NUM_PIXELS);

        // allocate memory for the boolean in_halo box
        in_halo = (char *) malloc(sizeof(char)*TOT_NUM_PIXELS);

        // initialize
        memset(in_halo, 0, sizeof(char)*TOT_NUM_PIXELS);

        if(global_params.OPTIMIZE) {
            forbidden = (char *) malloc(sizeof(char)*TOT_NUM_PIXELS);
        }

#pragma omp parallel shared(boxes,density_field) private(i,j,k) num_threads(user_params->N_THREADS)
        {
#pragma omp for
            for (i=0; i<user_params->DIM; i++){
                for (j=0; j<user_params->DIM; j++){
                    for (k=0; k<user_params->DIM; k++){
                        *((float *)density_field + R_FFT_INDEX(i,j,k)) = *((float *)boxes->hires_density + R_INDEX(i,j,k));
                    }
                }
            }
        }

        dft_r2c_cube(user_params->USE_FFTW_WISDOM, user_params->DIM, user_params->N_THREADS, density_field);

It's saying that somehting in dft_r2c_cube is not initialized. But the density field is definitely initialized above, and I make sure that the hires_density box is definitely computed. Any ideas? I'm kinda stumped.

The other thing is if I just run this test locally it does not segfault (but valgrind does report that error nonetheless).

BradGreig commented 3 years ago

Hey @steven-murray, yeah that is certainly very odd behaviour. Perhaps something else is the problem, that doesn't cause an issue until here (but occurs earlier). Do you have a log file for the segfault location (or any other information that comes with the segfault). Also, is it only this test? Or if you suppress others does one of the other USE_HALO_FIELD tests fail? The test previous to this one also uses USE_HALO_FIELD=True, so I don't know if it is specific to USE_HALO_FIELD. Perhaps it has to do with the order in which the calls happen?

Also, you should be able to construct a simple MWE for valgrind, with just an initial conditions and then call to compute halos. Especially if the problem is specific to the reported location (which I cannot see why).

So the RNG a and b have been behaving oddly to me. When in multi-threading, they actually return zeros rather than an actual number (only for the log output, works properly in the loop). Not actually sure the source of this, whether it has to do with these being threaded and private and thus the compiler not actually using the quantities after the #pragma. But yeah, the outputs always return zeros for the log.

A couple of things I would suggest for the dft_r2c_cube issue would be to maybe try removing the multi-threading (#pragma) of filling the density loop (i.e. just do it in single core). Or maybe try and do some operation with high_res_density (maybe just compute the mean) earlier in the function (in ComputeHalos) to see if these quantities are correctly defined.

Unfortunately these are only random suggestions as nothing definitively stands out to me.

Let me know how you get along, if you are still having issues, maybe I can try valgrind on my end on Monday to see if I can narrow down the source.

steven-murray commented 3 years ago

Thanks @BradGreig, these are all useful suggestions. I'll try all of them today and let you know. Unfortunately it takes a while to run. I'll try making a nice MWE, but since I don't exactly know the cause, I'm afraid that changing anything might "fix" it.

As far as the log is concerned that should go with the valgrind output... for some reason the way I call it doesn't output the normal 21cmFAST log. I'll try to fix that when I run it again.

steven-murray commented 3 years ago

Report on tests that I ran:

Running valgrind over the exact same test on the master branch* gave absolutely no output, so it definitely seems like a problem with this branch.
On this branch (granular-io) I temporarily updated run_lightcone to be able to continue in its loop without computing anything else but the perturb_halo_field and made a small MWE that runs this function with the input parameters I quoted above. I do get the same output from valgrind as for the full test (and in much less time): valgrind_halo_field_mdz_ptonly_filtered.txt. This also has the DEBUG logging turned on. I haven't gone through it in more detail yet.
I took away the LOG_DEBUG of the random parameters in GenerateICs.c because it generates too much noise. I also added what you mention -- taking the mean of the hires_density just above the dft call in FindHaloes.c. This is the valgrind output: valgrind_halo_field_mdz_ptonly_compute_mean_filtered.txt. Not sure what to make of it, but it's not reporting anything when I compute the mean, so I don't think it's to do with the hires density. It seems to be something about the way the DFT is happening, but I don't understand why.

it's really not the master branch but my branch off of master where I've added some extra debugs and prints etc., but nothing that should* change results.

qyx268 commented 3 years ago

@steven-murray , have you updated this branch here? Or these tests were only done locally where there are more updates in this branch? I'm asking because the Action tests are not failing on halo_field

BradGreig commented 3 years ago

Hey @steven-murray, yeah, given the source and location of the error (and the length of time valgrind takes) I thought that a MWE would be extremely useful (given it happens so early).

My logic to having you construct the mean value was to have the code try and access the data from memory, and thus produce an error if there was an initialisation error to do with that data.

I would suggest extending this to do the same for density_field below it's original assignment, and also with the dft_r2c_cube function prior to it being fft'd. I would also suggest outputting (with their own individual print statement) the other arguments to dft_r2c_cube to check if they have been corrupted in some way.

At this point I'm just suggesting trialing random things to see what aspect of the code is causing the problems. I'll see if I can get to looking at it today, but no guarantee's

steven-murray commented 3 years ago

@qyx268 it's consistently segfaulting at the halo_field_mdz test for which I gave the parameters above.

@BradGreig yeah that's good logic. Looks like it wasn't that the hires_density is un-initialized, so must be something else. I'll give your suggestions a try.

steven-murray commented 3 years ago

Trying to figure out when it segfaults and doesn't:

Segfaults at halo_field_mdz if all tests are run
If all tests except the test_power_spectra_coeval are run: no segfault (but 7 tests do fail)
If just the integration tests are run (with coeval and lightcone) it segfaults in the same spot.
If I just run the integration tests with "halo_field" in the name (coeval and lightcone), I get a segfault at the same spot.
If I just run the coeval "halo_field_mdz" then the lightcone one, I get the segfault

BradGreig commented 3 years ago

Hey @steven-murray, well the good thing is it seems fairly consistent in it's failing. Bad things is still no idea why. I haven't had the time yet to look at this myself to give you a hand. But hopefully by the end of the week.

steven-murray commented 3 years ago

@BradGreig yeah, it is good that it's consistent. I ran just the two tests (coeval then lightcone) through valgrind, but it gave me "The 'impossible' has happened!" error without giving a detailed report about where things went wrong :-(

However, given that the halo field really only uses the initial conditions, and the initial conditions are stored in the coeval, I'm certain it has something to do with this.

steven-murray commented 3 years ago

@BradGreig and @qyx268 I have fixed the error... well, kind of.

The main thing wrong turned out to be that the determine_halo_list function was not being passed the same direc and hooks that all the other functions were. In effect, this meant that by default it was ALWAYS caching to the default directory, which caused problems when running both coeval and then lightcones. This is now fixed, and the tests now only cache the initial conditions (which is what they're supposed to do).

However, it also means that there's something wrong with the caching of the HaloField because it segfaults if it tries to read it from cache and compute again. I have not figured out WHY this is an issue. However, it should only show up very occasionally, and I think we can try deal with it later.

BradGreig commented 3 years ago

Hey @steven-murray, excellent progress!

Hmm, I'm not sure I'm comfortable with leaving it as is with it occasionally crashing. Though I can understand wanting a break from it given the work you have put into it. Could you give a brief description of the cause of the error and a MWE? Unless of course it is exactly the same as it was before. I can maybe try and set aside some time to have a look at this in greater detail. Maybe a different set of eyes might illuminate something.

Could it have to do with how the memory for the halos are allocated? That on first pass we don't know the size, but on cache it get's confused?

steven-murray commented 3 years ago

@BradGreig looks like its still segfaulting on halo field stuff, so you're gonna get your wish of fixing this properly :-P

Yeah, the weird thing about the halo field class is that it has some arrays that are allocated in C rather than python. This seems like the obvious starting point for finding what's going wrong. The tests that segfaulted on GH did no set segfault on my own system, but perhaps its again due to some combination of tests. I'll look into it. By all means if you have some time it'd be great if you could run some tests on your system!

steven-murray commented 3 years ago

@BradGreig update after I looked more closely at the GH logs: it's actually the perturb_field_data tests that are segfaulting, and ONLY on Ubuntu 🤷🏻 .

Running just the perturb data tests through valgrind locally doesn't yield much -- I get that same pesky "Invalid read" from the dft, but no segfault or anything useful. Also, that invalid read is actually occurring in Python, from memory that was supposedly deallocated within the DFT in C. I have no idea how to track that. I'm going to try running the tests on my local Ubuntu cluster to see if I get anything.

codecov[bot] commented 3 years ago

Codecov Report

Merging #220 (1705d35) into master (5544c45) will decrease coverage by 1.13%. The diff coverage is 85.44%.

:exclamation: Current head 1705d35 differs from pull request most recent head 8268e0f. Consider uploading reports for the commit 8268e0f to get more accurate results

@@            Coverage Diff             @@
##           master     #220      +/-   ##
==========================================
- Coverage   86.53%   85.40%   -1.14%     
==========================================
  Files          12       12              
  Lines        2444     2733     +289     
==========================================
+ Hits         2115     2334     +219     
- Misses        329      399      +70

Impacted Files	Coverage Δ
src/py21cmfast/inputs.py	`89.50% <ø> (-1.11%)`	:arrow_down:
src/py21cmfast/_utils.py	`86.58% <78.28%> (-5.22%)`	:arrow_down:
src/py21cmfast/wrapper.py	`87.41% <87.93%> (-1.24%)`	:arrow_down:
src/py21cmfast/outputs.py	`92.16% <91.53%> (+0.26%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e9b044c...8268e0f. Read the comment docs.

steven-murray commented 3 years ago

@BradGreig could you give this a quick once-over again? Everything is working and I'll add another test tomorrow to increase coverage, but I'd really like to get it merged in the next ~24 hours!

steven-murray commented 3 years ago

@BradGreig Actually, it turned out I hadn't fixed the crashing problem, but I have now. The problem was that the C-based memory fields, if read from a file after caching, would NOT be controlled by C, and were being double-freed when the object was deleted (because I had assumed that the memory was controlled by C and used the C-based free_phf function etc.

This is now fixed by keeping better track of who owns the memory, and its state. I also updated the code so that we don't need to specify free_phf and co.: we just use the free function on all the C-based pointers from Python.

I added a test specifically to test this behaviour (and it initially segfaulted). It passes now!

21cmfast / 21cmFAST

feat: add ability to partially read datasets, and partially specify #220

Codecov Report