Force df type match during minitree load

JelleAalbers commented 6 years ago

This is a workaround for #232: it forces all minitrees into the same format during loading, set by the first dataset given to the load command.

The columns and data types a treemaker produces are determined dynamically from the output. Thus, inevitably, some minitrees produce a different set / type of columns in rare situations. There is also the mysterious 'index' column, which appears under unknown (to me) circumstances in pandas and can find its way into a minitree.

When either of these happens, current versions of dask throw the 'metadata mismatch' error. Since the errors started to appear recently, I assume previous versions did something else (perhaps pathological).

After this PR, if you load 1000 minitrees and number 523 has one of the defects above, it will be adapted to fit the format of the first minitree. Missing columns are filled with NaNs (INT_NAN from pax in the case of integer columns), and extra columns are deleted. Type mismatches are resolved as best as we can too.

If the first minitree has e.g. a spurious index column (the most common problem reported in #232), you would now get an extra 'index' column for the entire set of runs you're loading (but for most runs it will be set to NaN).

This is a workaround: a proper fix would be to change the minitree format to require a full specification of the data type of every minitree, and enforce it in hax. This is the approach of strax.

JelleAalbers commented 6 years ago

Yes, you can reproduce the problem (#232) with

import hax
hax.init(minitree_paths=['/project/lgrandi/xenon1t/minitrees/pax_v6.8.0',
                         '/project2/lgrandi/xenon1t/minitrees/pax_v6.8.0'],
         pax_version_policy='6.8.0')
df = hax.minitrees.load([6931, 6932])

and verify it is fixed in this branch. (thanks to Miguel in https://github.com/XENON1T/hax/issues/232#issuecomment-408857357 for listing the problematic run numbers)

feigaodm commented 6 years ago

@JelleAalbers Does it change total number of good events in the whole SR1 dataset? It would be interesting to see if some good events are removed by this. But it looks fine to me.

JelleAalbers commented 6 years ago

The datatype forcing operates column by column, and outputs the same number of rows. So in principle no events get removed.

However, I don't know what dask's behaviour with these mismatching minitrees was before errors started being generated (assuming the mismatching minitrees were there before). Maybe it added zeros where it should have added NaNs, and then through the cuts, that changes something. Or maybe something really weird happened before, it's hard to say. I hope we're still keeping the old hax and dask that were used for SR1 in the frozen 6.8.0 environment, so we can always compare.

feigaodm commented 6 years ago

@JelleAalbers Thanks! By your explanations I'm confident enough that this won't be a big issue, but agree to backup. I would think to make a copy of minitrees to a folder like pax_v6.8.0_hax_v2.4.0. Do you agree @pdeperio ?

JelleAalbers commented 6 years ago

Sorry if I wasn't clear, this won't change any minitrees you already made, or even minitrees you're making in the future. It just changes how they're loaded, specifically, how minitrees from different runs are combined. Yyou probably do want to make a backup of the cache file / h5 / csv you got for the final dataset, but I assume you already have that.

feigaodm commented 6 years ago

@JelleAalbers Yes, I know this detail. Maybe I was trying to make it more general. In principle we shall make the copy anyway. I think we shall make hax_v2.5.0 after this PR is merged.

JelleAalbers commented 6 years ago

OK, sounds good!

pdeperio commented 6 years ago

OK I've started the copy to dali-login.rcc.uchicago.edu:

cp -r /project*/lgrandi/xenon1t/minitrees/pax_v6.8.0 /dali/lgrandi/xenon1t/minitrees/.

For reference, the data post-lax is stored here and the README contained therein.

XENON1T / hax

Force df type match during minitree load #236