21cmfast / 21cmFAST

Official repository for 21cmFAST: a code for generating fast simulations of the cosmological 21cm signal
MIT License
58 stars 38 forks source link

[BUG] Ionization Box changes depending on your cache #234

Open steven-murray opened 3 years ago

steven-murray commented 3 years ago

Describe the bug: I came across this when testing for #220. In that PR I was finding that if I test just the lightcones, they would fail, but if I test both coeval and lightcones, they would all pass. The tests run the coevals first, so obviously the lightcones are having some dependence on the coevals. Now, this is somewhat by design, because I want to share the initial conditions at least between the runs to make things faster. However, doing so should not break the tests in this way.

I tracked it down to the following. The ionize_box routine takes a previous_ionize_box instance. This is not necessary unless doing INHOMO_RECO or USE_TS_FLUCT. However, in a lightcone, regardless of whether using these flags, the previous_ionize_box is passed through (because it's been calculated anyway). It seems that if it is passed through, the C routine uses the previous box regardless of whether INHOMO_RECO or USE_TS_FLUCT are true. This makes a difference between a coeval run directly at a redshift, and a lightcone that's being run up to that same redshift.

To Reproduce: Steps to reproduce the behavior:

  1. Run a coeval: run_coeval(redshift=12, write=True)
  2. Run a lightcone: lc = run_lightcone(redshift=12)
  3. Run another lightcone that doesn't use the cache: lc2= run_lightcone(redshift=12, regenerate=True)
  4. Compare eg. the z_re_box: np.sum(lc.z_re_box), np.sum(lc2.z_re_box)

Expected behavior:

I'm not perfectly sure what to expect, other than that I think we should expect the output to be the same for the same inputs, regardless of what's in the cache. I can see a few ways to fix this:

  1. In ionize_box, always use the dummy previous_ionization_box if not using INHOMO_RECO or USE_TS_FLUCT (even if it's passed in).
  2. Just in run_lightcone ensure that previous_ionization_box=None if the above conditions hold.
  3. Don't worry about it (it's not a big different and only comes up in some corner cases)
  4. Change the C code to not use the previous box if the conditions aren't true.
steven-murray commented 3 years ago

Another possibility is that we should expect the output of ionize_box to be different if you give it a different previous_ionize_box (even if INHOMO_RECO and USE_TS_FLUCT are False), but that the hash in the cache should also then be different.

steven-murray commented 3 years ago

Thinking more about the possibility of specifying the hash differently based on the previous box, it would require some significant re-plumbing. Namely, the instance of the box would have to carry references to all the output structs that form its input (as well as the base input parameters). That is, when instantiating the box, we'd have to pass in all its dependent structs, eg.

ib = IonizationBox(
    redshift=...,
    user_params=....,
    previous_ionization_box=...
)

At the moment, we only pass in the basic input parameters to define the box, and pass the other output structs only when we call compute. The difficulty here is that typically just from the inputs, we know whether such a box exists in cache, and can therefore read it in before we compute any of its necessary dependencies. If we need its dependencies to calculate its hash, then those need to be read in or computed before we can check whether this box exists in cache.

Conversely, we could just depend on a couple of extra input parameters from the boxes it depends on. For example, we could depend only on the previous_ionization_box.redshift. But this becomes a bit hacky -- you have to individually specify these for each struct, and you still need to calculate such things before check the box's existence in cache.

Long story short -- it's a bit tricky. We should decide on how to do this though. Calling on @BradGreig and @qyx268 for ideas.

qyx268 commented 3 years ago

hmm... I think by the definition of if not (INHOMO_RECO or USE_TS_FLUCT), we should go for option 4 -- Change the C code to not use the previous box if the conditions aren't true.

Looking at IonisationBox.c, I think this means we only need to change these lines:

                                if (previous_ionize_box->z_re_box[HII_R_INDEX(x,y,z)] < 0){
                                    box->z_re_box[HII_R_INDEX(x,y,z)] = redshift;
                                } else{
                                    box->z_re_box[HII_R_INDEX(x,y,z)] = previous_ionize_box->z_re_box[HII_R_INDEX(x,y,z)];
                                }

Other lines where previous_ionize_box is used requires either INHOMO_RECO or MINI_HALOS, the latter needs USE_TS_FLUCT.

qyx268 commented 3 years ago

Can I confirm with you @steven-murray that it is only z_re_box that is different between the two runs? If so, to fix it, I think we just change the lines above to

                                if ((flag_options->INHOMO_RECO or flag_options->USE_TS_FLUCT) and (previous_ionize_box->z_re_box[HII_R_INDEX(x,y,z)] > 0)){
                                    box->z_re_box[HII_R_INDEX(x,y,z)] = previous_ionize_box->z_re_box[HII_R_INDEX(x,y,z)];
                                } else{
                                    box->z_re_box[HII_R_INDEX(x,y,z)] = redshift;
                                }

However, this is a downside of this -- without these two flags, z_re_box would be inaccurate, which is always true for run_coeval though...

steven-murray commented 3 years ago

@qyx268 yes, as far as I can tell it's just the z_re_box that is different (and things dependent on it). My fear was that the solution is more accurate if the previous_ionize_box is passed in. In that case, we definitely want to pass it whenever we can, right? We just want to make sure that if it is used, it gets cached with a different hash than if it wasn't used.

qyx268 commented 3 years ago

THIS IS FROM #220 . reposting for easy reference....

There are a couple of different facets to this that need considering, and this all depends on desired behaviour.

But first, even running steps 2 and 3 independently is producing different results for the brightness temperature off the master branch (with an empty cache each time). Though, the difference is minor. However, these should be exact given it's the same computation. I don't know why that is happening, but it leads me to believe there must be something changing (possibly a memory leak or something somewhere). Ionisation fraction seems fine, it's just the brightness temperature.

Now, back to the question at hand.

Firstly, as far as I can tell this has only appeared as now in the integration tests all quantities are checked rather than just the brightness temperature. Meaning, the brightness temperature will be passing all checks. So the question is does it even make sense to be checking all quantities. I can see the value to it, but it only makes sense if the quantity is always stored.

Now, there is no harm in keeping the z_re_box even if INHOMO_RECO=False or USE_MINI_HALOS=False. What z_re_box does is store a cells ionisation redshift. This is perfectly reasonable to keep for any light-cone options. Just because it's not used in a calculation doesn't mean that it isn't a useful quantity to have access to. In that regard I'd vote against anything that just sets previous_ionize_box = False.

Importantly, z_re_box should not be compared between coeval and lightcone quantities (for the obvious reason that the ionisation redshifts will differ).

Now, given I am noticing differences in the outputs from various runs, I think that is going to make it hard to check what's happening with z_re_box.

I wonder if there is a way we can store some additional metadata into the stored files, so it knows when it finds an existing snapshot whether that was in a lightcone or co-eval context (so that it knows whether or not to trust these evolving quantities). Not sure if it already does this in the hash.

qyx268 commented 3 years ago

I think the intention was not to make a comparison between coeval and lightcone quantities. The issue rises when there was alreay a cached file from previous coeval boxes where z_re is different. So do we use or not use the cached files.

I agree if we still want to keep an accurate calculation of z_re in lightcone, we should reuse the previous_ionized boxes. Then we will need to rethink about how to do the hash key as @steven-murray mentioned.

qyx268 commented 3 years ago

But first, even running steps 2 and 3 independently is producing different results for the brightness temperature off the master branch (with an empty cache each time).

@BradGreig how large the difference are we talking about? Did you fix the seed?

steven-murray commented 3 years ago

OK, so for now, in #220, I've modified the produce_integration_test_data.py so that ONLY initial conditions are cached -- everything else is always produced from scratch for each test. This means that we don't get coeval boxes used in the lightcone for these tests, which helps everything to pass. But it doesn't solve the underlying problem (and maybe should be switched back to being cached anyway, because then the tests catch weird stuff like this).

If we think that using the previous_ionize_box is useful if possible at all, then @qyx268 is correct in that we should be adjusting our hash based on that information. This is the most sure way to guard against using a box that doesn't have exactly the same input. However, as I mentioned previously, this will be a little bit of plumbing to ensure that the input information is available at the time we create and require the hash.

If we think that it's only ever going to be z_re_box that would cause such an issue (I know that we require a previous_perturb_field and previous_spin_temp at times as well, so not sure if they'd also create differences like this -- they don't seem to in the tests), then we could just special-case this one box somehow. That would be easier to implement now, but I fear if we add another quantity that has a similar effect that we'd end up having to hack that in as well.

I think the unique thing about z_re_box is that it is used whether or not we evolve the box -- if we're not evolving it, it just gets a default value. I think other boxes just don't get used at all if we're not evolving.

BradGreig commented 3 years ago

Ok, a couple of things. That issue I was referring to isn't really an issue. It was caused by me not appreciating not passing a random seed can cause it to have different seeds (as they are random). Fixing the seed fixed that, but then the ionisation box is different. It's something to the 4-6th significant figure. So there is something causing this difference, but it isn't large enough to cause concern.

As to the matter at hand, fixing the tests to only used cached initial conditions doesn't prevent the user from messing up their results if this happens. So I don't know if this is a sufficient fix.

When you say some plumbing will be required, how much effort are we talking. Why can't we just pass an extra argument to the ionize_box and spin_temperature function that contains the previous redshift. This is a None type by default, but takes an argument containing the redshift of the previous ionize and spin temperature box. This, then should be easily dealt with (contained by) the creation of the unique hash (as this additional redshift argument will distinguish whether the box contains requisite information).

I may be misunderstanding how things work under the hood but that to me seems a fairly straight-forward solution.

steven-murray commented 3 years ago

@BradGreig -- ah OK no worries.

I agree, fixing this in the tests doesn't fix it for users -- we still have to deal with the problem.

Let me try to summarize what I mean by the plumbing. Right now, each output struct (subclass of OutputStruct) is defined as a class, eg:

class IonizationBox:
    def __init__(self, redshift, user_params, astro_params, etc., *, direc, other_high_level_flags_that_arent_needed_for_C):
        self.redshift=redshift
        self.user_params=user_params
        ...
    def hash(self):
        return <some string based on the attrributes of the class>

So, all parameters required to compute the struct are passed in at initialization, and the hash is based off those (notably, the redshift and various parameter structs). However, to actually compute the box requires a different method, compute, which itself takes in all of those parameters again (but typically we just pass the structs attributes back into it) plus a bunch of other boxes (eg. initial conditions, perturb field, previous ionize box). So, the hash itself is defined simply based on the initialization values, but the box is computed using other boxes as well.

Now, this would all be self-consistent if those other boxes were always completely defined by the input parameters to this box. This is almost always true -- the user_params for ionize_box has to be the same as those for perturb_field etc. However, we've found a case in which it's not true -- namely, the previous_ionize_box can be defined at different redshifts based on how you call it, and this is not reflected in any of the input parameters (to the initialization).

Another piece of information that is important in deciding what to do is the general structure of the medium-level functions. They have a structure like the following:

def ionize_box():
    box = IonizedBox(init_params)
    if box.check_if_exists_based_on_hash():
        box.read_from_cache()
        return box

    # Everything else only happens if the box doesn't exist
    init_box = initial_conditions()
    .... # get all the other input boxes

    return box.compute(input_params, input_boxes)

The main point to note here is that we use just the input params to initialize the "box", which we can then directly check prior existence based on its hash. That is, we can compute that hash before we ever have to compute/read any of the other boxes, which is the fastest way to do things. But this also means that the other input boxes can't contribute to the hash.

So, with that, here's a list of potential solutions (that I can currently think of):

  1. Initialize IonizedBox (and other relevant structs) with the redshifts of any of the previous boxes (ionize box, perturb field). Probably this is enough, since all the other parameters of those boxes are the same as the current box. This would mean we'd have to move up the calculation of the previous redshift of those boxes, but this is cheap. The downside here is that it's unclear to me that this will be the only other parameter we'll ever need to capture, at which point we'd have to rethink this process.
  2. initialize IonizedBox with all of the parameters required for its computation (i.e. all the other boxes as well), so we can compute the hash based on that. This is extremely general, and so will Never Break Again(TM), but it's a lot more work, and it's not clear to me how to keep the efficiency in the functions, because we'd need to compute the dependencies BEFORE checking the hash. We could do something like find the dependent boxes but NOT read them in (just their metadata), so we can compute their hashes quickly. Then after computing the hash of the current box (which would now be dependent on those hashes) and checking for its existence, if it doesn't exist, we'd actually read in the data from the other boxes. But again, this is a fair bit of work and a bit tricky.

Given its simplicity, I'm leaning towards (1), but I do find it an attractive option to be more rigorous in our caching and future-proof. Perhaps we should re-consider option (2) in the future (perhaps in v4).

BradGreig commented 3 years ago

Hey @steven-murray, thanks for the detailed explanation.

Yes, option (1) is exactly what I was proposing, owing to it's simplicity. IonizedBox and TsBox are the only two that should have a dependency on previous quantities. Thus, I think implementing a previous redshift into the hash should make it robust to most possible issues. I say "most" here as it is theoretically conceivable to still have problems, but I think it would have to be a lot more contrived to cause problems. It would be robust in the general sense. But, it would clearly distinguish instances where it was generated using previous information or as a stand alone, which is the main problem.

Of course, anything that it fully general and robust is preferable (i.e. (2)), however, likely contains many nuances that would require much time to figure out all the details. So I agree it may not be worth it until later. Or, perhaps (1) ends up being sufficient that we don't need any further modifications.

qyx268 commented 3 years ago

agreed

steven-murray commented 3 years ago

@BradGreig and @qyx268 -- great, thanks for your input! I'll try working on that tomorrow in #220.

steven-murray commented 3 years ago

Just a couple of notes from implementing this:

  1. I found a small extra bug in spin_temperature. The previous spin temp is always evaluated at a maximum of Z_HEAT_MAX, which may actually be less than the current redshift (we never check if the main redshift of evaluation is smaller than Z_HEAT_MAX). Not sure if this will be a problem at all, but I thought I should mention it.
  2. It turns out just passing in the redshift of the previous spin/ionize box is NOT enough to fully define the box, because of second-order dependencies (i.e. what was the previous redshift of that box?). I have in my mind a fairly generic way of dealing with this, but it will have to wait til v4. I think at least that having the first-order dependence captured will deal with most of the situations.
BradGreig commented 3 years ago

Hey @steven-murray, I'm not sure I entirely follow the first point. When is this occurring? And in what piece of code/function? I was trying to follow your argument, but not sure I get when this may occur. Maybe you can provide the conditions for which this occurs?

For the second point, yeah, it was never going to be fool proof. However, I think the likelihood of encountering this should be extremely low. I find it hard to envisage a scenario when it grabs a box from cache (using current and previous redshift) that would have been connected to a previous box that would have a combination of redshifts that is not compatible with the original combination. This comes from the log scrolling, it'll be hard to have weird combinations as a result. Not impossible, but would have to be contrived.

steven-murray commented 3 years ago

Note that the first point above (about Z_HEAT_MAX) has been split out into its own issue: #245