merging BEAST output files

lea-hagen commented 4 years ago

We seem to be missing some merging tools. Here's a summary of what we have, which are called as needed in the beast.tools.run.merge_files wrapper:

File	Merge across subgrids	Merge across source density or background
Stats	`merge_pdf1d_stats` (beast/tools/subgridding_tools)	`merge_stats_files` (beast/tools/merge_beast_stats)
1D PDF	`merge_pdf1d_stats` (beast/tools/subgridding_tools)	none
lnp (sparsely sampled log likelihood)	none	none

I suspect the latter two aren't fully merged because they were intended to be sorted into spatial regions for the megabeast (reorder_beast_results_spatial and condense_beast_results_spatial). But we now have use cases (e.g., #373) where having them all together may be useful.

@drvdputt, have you thought about what merging the lnp files across subgrids would entail? I don't want to rethink it if you've already done so.

galaxyumi commented 4 years ago

I am confused. Why two separate files - 1D PDF and lnp? Aren't we saving lnp values in 1D PDF files for all parameters?

lea-hagen commented 4 years ago

The 1D PDF files only have the marginalized probabilities for each parameter. The lnp files have 500 probabilities from the full set of lnp values - we'd save the lnp values for every point in the model grid, except that would take up too much disk space, and at some point we worked out that 500 was a reasonable number.

galaxyumi commented 4 years ago

Ah, that's new to me. Thanks for the clarification!

drvdputt commented 4 years ago

@drvdputt, have you thought about what merging the lnp files across subgrids would entail? I don't want to rethink it if you've already done so.

I though about this for a short time when implementing the merge tool, and found that we cannot merge the lnp files as-is. The problem is that in their current state, the lnp files store a list of indexes to points in the grid. But when the grid is split into subgrid, those indices are no longer unique (as the index is just the position in the data, as loaded from file).

So first you would need to find a way around that problem. Maybe store a subgrid number and an index in the final result? Or give each grid point another unique identifier of some sort?

About the merge itself I also did some thinking (assuming that the above problem doesn't exist). It can not be guaranteed that the the highest lnP values all fall in the same subgrid. Maybe we can simply take the 500 highest lnP values over the all the subgrids. The caveat is that lnP is not normalized in the same way. I already discussed this in #189. So you should take into account the total weight of each subgrid, 'total_log_norm', which you can get form the (sub)stats file. (this is log(sum(exp(lnps)))).

lea-hagen commented 4 years ago

Maybe store a subgrid number and an index in the final result?

Yes, this is what I was thinking of doing. Each lnp file has the subgrid number in the filename, so this seems like the easiest option.

Maybe we can simply take the 500 highest lnP values over the all the subgrids.

There was discussion in #121 (and in the referenced group meeting) about how to choose the sparsely sampled lnP values, and we figured out that it should be random, rather than weighted towards the best fits. But I'm not sure how implement that in the context of subgrids, where each grid probes different total volumes of parameter space. Perhaps use total_log_norm as the weight?

The caveat is that lnP is not normalized in the same way [...] you should take into account the total weight of each subgrid

I haven't been able to figure this part out. Digging through fit.py, there's only one place I see the lnp value getting edited before it's saved, and that's to apply the priors. Could you clarify?

[line 501]

(lnp, chi2) = N_logLikelihood_NM(
          sed,
          model_seds_with_bias,
          ast_ivar,
          mask=cur_mask,
          lnp_threshold=abs(threshold),
)

...

[line 512]

lnp += g0_weights  # multiply by the prior weights (sum in log space)

...

[line 545]

save_lnp_vals.append(
    [
        e,
        np.array(g0_indxs[rindx], dtype=np.int64),
        np.array(lnp[rindx], dtype=np.float32),
        np.array(chi2[rindx], dtype=np.float32),
        np.array([sed]).T,
    ]
)

...

[line 627/656]

save_lnp(lnp_outname, save_lnp_vals, resume)

karllark commented 4 years ago

Or we could just store 500 x # of subgrid lnp values in the merged files. Shouldn't be too much space and this way we can put off the decision. And we would have more lnp values which is never a bad problem. :-)

lea-hagen commented 4 years ago

True! But perhaps trim them a bit, using the same threshold as in the fitting (I think -10?) to remove the lowest probability ones, in case any of the subgrids are entirely terrible fits.

drvdputt commented 4 years ago

@lea-hagen, I figured it out again. Indeed, lnp is only modified by g0_weights in the line you referenced. What I was worried about was this

log_norm = lnps.max()
weights = np.exp(lnps - log_norm)

But this only matters for the stats / 1dpdf files, not the lnp files. So, on the condition that do_not_normalize for the prior weights was set to True, there should be no problem in interpreting the lnp values as-is.

lea-hagen commented 4 years ago

I just noticed that this isn't linked to #493, which implemented merging lnPs across subgrids.

BEAST-Fitting / beast

merging BEAST output files #466