Open lea-hagen opened 4 years ago
I am confused. Why two separate files - 1D PDF and lnp? Aren't we saving lnp values in 1D PDF files for all parameters?
The 1D PDF files only have the marginalized probabilities for each parameter. The lnp files have 500 probabilities from the full set of lnp values - we'd save the lnp values for every point in the model grid, except that would take up too much disk space, and at some point we worked out that 500 was a reasonable number.
Ah, that's new to me. Thanks for the clarification!
@drvdputt, have you thought about what merging the lnp files across subgrids would entail? I don't want to rethink it if you've already done so.
I though about this for a short time when implementing the merge tool, and found that we cannot merge the lnp files as-is. The problem is that in their current state, the lnp files store a list of indexes to points in the grid. But when the grid is split into subgrid, those indices are no longer unique (as the index is just the position in the data, as loaded from file).
So first you would need to find a way around that problem. Maybe store a subgrid number and an index in the final result? Or give each grid point another unique identifier of some sort?
About the merge itself I also did some thinking (assuming that the above problem doesn't exist). It can not be guaranteed that the the highest lnP values all fall in the same subgrid. Maybe we can simply take the 500 highest lnP values over the all the subgrids. The caveat is that lnP is not normalized in the same way. I already discussed this in #189. So you should take into account the total weight of each subgrid, 'total_log_norm', which you can get form the (sub)stats file. (this is log(sum(exp(lnps)))
).
Maybe store a subgrid number and an index in the final result?
Yes, this is what I was thinking of doing. Each lnp file has the subgrid number in the filename, so this seems like the easiest option.
Maybe we can simply take the 500 highest lnP values over the all the subgrids.
There was discussion in #121 (and in the referenced group meeting) about how to choose the sparsely sampled lnP values, and we figured out that it should be random, rather than weighted towards the best fits. But I'm not sure how implement that in the context of subgrids, where each grid probes different total volumes of parameter space. Perhaps use total_log_norm
as the weight?
The caveat is that lnP is not normalized in the same way [...] you should take into account the total weight of each subgrid
I haven't been able to figure this part out. Digging through fit.py
, there's only one place I see the lnp
value getting edited before it's saved, and that's to apply the priors. Could you clarify?
[line 501]
(lnp, chi2) = N_logLikelihood_NM(
sed,
model_seds_with_bias,
ast_ivar,
mask=cur_mask,
lnp_threshold=abs(threshold),
)
...
[line 512]
lnp += g0_weights # multiply by the prior weights (sum in log space)
...
[line 545]
save_lnp_vals.append(
[
e,
np.array(g0_indxs[rindx], dtype=np.int64),
np.array(lnp[rindx], dtype=np.float32),
np.array(chi2[rindx], dtype=np.float32),
np.array([sed]).T,
]
)
...
[line 627/656]
save_lnp(lnp_outname, save_lnp_vals, resume)
Or we could just store 500 x # of subgrid lnp values in the merged files. Shouldn't be too much space and this way we can put off the decision. And we would have more lnp values which is never a bad problem. :-)
True! But perhaps trim them a bit, using the same threshold as in the fitting (I think -10?) to remove the lowest probability ones, in case any of the subgrids are entirely terrible fits.
@lea-hagen, I figured it out again. Indeed, lnp is only modified by g0_weights
in the line you referenced.
What I was worried about was this
log_norm = lnps.max()
weights = np.exp(lnps - log_norm)
But this only matters for the stats / 1dpdf files, not the lnp files. So, on the condition that do_not_normalize
for the prior weights was set to True
, there should be no problem in interpreting the lnp values as-is.
I just noticed that this isn't linked to #493, which implemented merging lnPs across subgrids.
We seem to be missing some merging tools. Here's a summary of what we have, which are called as needed in the
beast.tools.run.merge_files
wrapper:merge_pdf1d_stats
(beast/tools/subgridding_tools)
merge_stats_files
(beast/tools/merge_beast_stats)
merge_pdf1d_stats
(beast/tools/subgridding_tools)
I suspect the latter two aren't fully merged because they were intended to be sorted into spatial regions for the megabeast (
reorder_beast_results_spatial
andcondense_beast_results_spatial
). But we now have use cases (e.g., #373) where having them all together may be useful.@drvdputt, have you thought about what merging the lnp files across subgrids would entail? I don't want to rethink it if you've already done so.