Normalization of p - Githubissues

dfujim commented 4 years ago

We currently normalize by the sum of p. However, we should be normalizing by the integral, which is not the sum, since the bin sizes are logarithmically spaced, and the normalization for each element of p will be different. .

rmlmcfadden commented 4 years ago

I follow your concern - I'll need to think on it some more. While p(λ) ought to be histogram, this isn't 1:1 compatible with how the problem is tackled.

Pragmatically, the chosen λs are not really bins - just discrete points sampled at an arbitrary rate over an arbitrary range. For sufficiently high sampling, shouldn't the discrete probability vector p(λ) just become the continuous probability distribution p(λ)? In that limit, any difference between summation/integration should be moot. I think this suggests that using simple summation to invoke the normalization condition isn't the problem so much as too coarsely approximating p(λ).

Thinking about this a little more, this also implies that, in practice, the best one can do is be self-consistent (e.g., by fixing the range/sampling of λ) when comparing the distributions obtained for different spectra.

In any case, I agree with your comment in the draft that it is best to not even mention the normalization of p(λ)!

dfujim commented 4 years ago

It is probably important that we get this right, so that if someone wanted to fit the distribution with a function (say a KWW distribution) then they'd be able to do so without distortion. The probability should normalized in the following way, in the continuous limit:

$\int p(λ) dλ = 1$

Discretized, this should be

$\sum_i p(λ_i) Δλ_i = 1$

Where Δλ is the spacing between the sampled points. The probably density needs to have units of 1/λ. So I see where you're coming from, but I think this means we need to account for the spacing. In the continuous limit, the spacing doesn't matter so much because it's so small, or you'd do some transformation of variables so that the spacing was handled properly.

rmlmcfadden commented 4 years ago

Yes, you're right - you've convinced me. We do need to account for the spacing. I'll start thinking on how to implement this. Any feeling on the "binning" convention to use (i.e., taking each sampled λ as the high/low bin edge vs. the centre)?

dfujim commented 4 years ago

My feeling is to take them as the centers. Taking them as the edges feels too... asymmetric to me. You could define the bin edges as the bisection points between the sampled λ and get the widths from there.

EDIT: now trying the problem, i don't think this is the solution. I immediately ran into two issues trying to figure out bin widths: (a) the bins are asymmetric, that is the point we've measured is not in the center of the bin; and (b) how does one determine the size of the final bin, unless it's a user input? (which is not really what we want, since as you pointed out, it's not really binned anyway)

By my thinking, our main problem is that p(λ) is skewed by the spacing of λ. Please correct me if I'm wrong: because the points at large λ are further apart than those at small λ, the weight assigned to the larger λ is too large, proportional to the spacing. If you were to make p(λ) continuous, this weight could be spread about in the nearby values of λ, but because it's discrete, the value is inflated. Therefore, the probability density at large λ must be smaller than the equivalent point in p(λ)/sum(λ). I think the solution is to define the following transformation of variables:

y = ln(λ)

where, because λ has log-spacing, y has linear spacing. The spacing has the relationship:

dy = 1/λ dλ

We can then use this to define a new probability distribution

p'(λ) = p(λ)/λ

such that summing over the points with log spacing is equivalent to summing over the original with linear spacing:

∫ p'(λ) dλ   =    ∫ p(λ) dy

I then postulate that p'(λ) should not have the skewing that should exist in p(λ). The normalization sum( p'(λ) ) also makes sense, I think. p'(λ)/sum( p'(λ) ) should be the probability density.

This only works for log-spaced λ, so we should probably leave the normalization to the user if they define their own bin spacing.

dfujim / bILT

Normalization of p #2