SasView / sasdata

Package for loading and handling SAS data
BSD 3-Clause "New" or "Revised" License
1 stars 2 forks source link

Completely general binning system with fractional binning #54

Closed lucas-wilkins closed 1 month ago

lucas-wilkins commented 1 year ago

I've completed the basic stuff needed to do fractional binning, it can take any kind of input binning and convert to any kind of output binning. I have only implemented the -1th order (existing rebinning system) and 0th order (fractional binning) interpolation so far, but there is room for higher orders in the future, most of the hard work is done.

As apparently there is some controversy over whether this is useful, I thought I would hang on with actually implementing all the new slicers in this way, until at least @butlerpd can see the results below. As you can see, the 0th order ("fractional binning") approach is much more stable, though as things are completely general, its a bit slower. The main cost of computation for orders >= 0 is working out the mapping between input and output bins, I have implemented various caching optimisations which mean that these calculations only need to be performed once, so applying a slicer to the same input/output mapping will be relatively fast.

slicer_demo.py contains the code used to generate these plots...

Here's a demo of the region sum applied to data sampled at random points

As you can see, the output of order 0 is pretty smooth compared with order -1.

Data: image

Example regions (output bin) image

Output for -1 and 0 order, value against bin size image

Here's a demo of the region average for data sampled on a grid

As you can see, the output of order 0 is much smoother than order -1. By symmetry, we know that the ideal result would be a constant value.

Data: image

Example regions (output bin) image

Output for -1 and 0 order, value against bin size (should read average, but I forgot to change the title) image

RichardWaiteSTFC commented 12 months ago

Just a couple of comments about bining - I apologise in advance if this isn't relevant, of use or indeed completely wrong!

In terms of treating resolution:

  1. I think the best way to account for resolution is to evaluate the objective function (incl. resolution convolution) for each voxel in the data (maybe even average the objective function of the voxel) and then do the equivalent binning before calculating the cost function on the binned data - I believe this is what Horace does (software for analysing direct inelastic data)
  2. One advantage of calculating the cost function on the binned data is to make sure you have enough counts to be able to use chi-squared as a cost function instead of the slower Poisson MLE (but of course it depends how slow it is to evaluate the objective function on each voxel...)
  3. Another advantage is that it might be easier to parameterize background functions etc.

I also have a particular concern about including fractions of a voxel:

  1. If you measure N counts in a bin, the error for that bin is sqrt(N). If you instead measured with half the bin width you would get two bins with on average N/2 counts (for a constant function though this may break down for small N...), with the error in each bin sqrt(N/2). However, if you naively applied error propagation to the fractional overlap method, you might instead come up with an error sqrt(N)/2 which I think is an underestimate. Do you agree, or maybe it's more subtle than that?
  2. You can also get odd artefacts, if you assume the counts are distributed evenly over the bin - I don't know if that is an assumption you're making?
    • A common effect is to systematically increase the width of peaks (though admittedly I don't calculate the uncertainty on the fit parameters here, I normally use numdifftools for this but I haven't got it installed on this computer)
    • Note here I am fitting using a cost function I believe is appropriate for Poisson-distributed errors, if I was using chi-squared the affect may be worse due to the possible underestimation of the errors.
    • You could interpolate instead to reduce such artefacts, but then you may run into issues with noisy data.

image

See some toy code here (warranty not included, may be bugs!)

import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize

# generate data
def fgauss(x, ht, cen, sig, bg):
    return ht * np.exp(-0.5 * ((np.asarray(x) - cen) / sig) ** 2) + bg

npts = 15
np.random.seed(1)
x = np.linspace(0,1,npts)
pinit = [50, (x[-1]-x[0])/2, (x[-1]-x[0])/8, 5]  # ht, cen, sig, bg
y = np.random.poisson(lam=fgauss(x, *pinit))
e = np.sqrt(y)

# bunch data (combine adjacent 3 bins)
nbunch = 3
xbunch = x.reshape(-1, nbunch).mean(axis=1)
ybunch = y.reshape(-1, nbunch).sum(axis=1)
ebunch = np.sqrt(np.sum(e.reshape(-1, nbunch)**2, axis=1))  # or equivalently np.sqrt(y) as data not normalised

# un-bunch the data using fractional area method
# could interpolate instead...
yfrac = np.repeat(ybunch/nbunch, nbunch)
efrac = np.repeat(ebunch/nbunch, nbunch)  # underestimates error (should be sqrt(y)?)

# try fitting with poisson MLE

def poisson_cost(params, x, ycnts, fobj):
    yfit = fobj(x, *params)
    # replace 0s in pdf with a small number
    # this stops ycnts[ifit]/yfit[ifit]) = inf (but does still punish fit)
    yfit[yfit<1e-4] = 1e-4  
    # get log-likelihood for non-zero bins
    ifit = ycnts>0
    D = 2*np.sum(ycnts[ifit]*np.log(ycnts[ifit]/yfit[ifit]) - (ycnts[ifit] - yfit[ifit]))
    # include contribution from zero bins (first term is zero)
    D = D + 2*np.sum(yfit[~ifit])
    return D

# fit data
res = optimize.minimize(poisson_cost, pinit, args=(x, y, fgauss), 
                        method='L-BFGS-B')
res_bunched = optimize.minimize(poisson_cost, pinit, args=(xbunch, ybunch/nbunch, fgauss), 
                                method='L-BFGS-B')
res_frac = optimize.minimize(poisson_cost, pinit, args=(x, yfrac, fgauss), 
                             method='L-BFGS-B')
print("Actual parameters p = ", np.round(pinit, 3))
print("Fit to simulated data p= ", np.round(res.x,3))
print("Fit to bunched data p= ", np.round(res_bunched.x,3))
print("Fit to unbunched data p= ", np.round(res_frac.x,3))

fig, ax = plt.subplots()
ax.errorbar(x, y, yerr=e, marker='o', capsize=2, ls='', color='k', label='data')
ax.errorbar(xbunch, ybunch/nbunch, yerr=ebunch/nbunch, marker='o', capsize=2, ls='', color='r', label='bunched')
ax.errorbar(x, yfrac, yerr=efrac, marker='s', capsize=2, ls='', color='b', markersize=2, alpha=0.5, label='unbunched')
ax.plot(x, fgauss(x, *pinit), '-k', label='actual pdf')
ax.plot(x, fgauss(x, *res.x), '--k', label='fit to data')
ax.plot(x, fgauss(x, *res_bunched.x), ':r', label='fit to bunched')
ax.plot(x, fgauss(x, *res_frac.x), ':b', label='fit to unbunched')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend(fontsize=8)
fig.show()

Prints out

#                           ht      cen     sig     bg
Actual parameters p =      [50.     0.5    0.125  5.   ]
Fit to simulated data p=   [51.471  0.519  0.12   3.206]
Fit to bunched data p=     [46.321  0.513  0.133  3.224]
Fit to unbunched data p= [42.608  0.513  0.144  3.35 ]

To be clear, I don't know if the above code is actually what you're doing here, my point is this can go wrong and it's worth checking!

andyfaff commented 10 months ago

Another aspect that wasn't mentioned by @RichardWaiteSTFC was that whenever you divide pixels to do fractional rebinning then the signal in adjacent pixels is no longer independent but becomes correlated. Thus to be 100% correct the error propagation needs to take into account covariance.

Here's a demo of how it arises:

import numpy as np
import uncertainties as unc
from uncertainties import ufloat
from uncertainties import unumpy as unp

a = ufloat(15, unp.sqrt(15))
b = ufloat(20, unp.sqrt(20))
c = ufloat(25, unp.sqrt(25))

d = 0.5*a + 0.5*b
e = 0.5*b + 0.5*c

print(repr(d))
print(repr(e))

print(np.array(unc.covariance_matrix([d, e])))

gives:

17.5+/-2.9580398915498085
22.5+/-3.3541019662496847
[[ 8.75  5.  ]
 [ 5.   11.25]]

if d and e were uncorrelated the off diagonal terms would be zero.

Another consequence of taking these kinds of correlations into account is that actually all datapoints in a reduced SAS measurement are correlated! As well as this kind of fractional rebinning the reduction typically uses steps like normalising images/spectra by dividing by a common monitor count. Since this monitor count has an uncertainty then all resultant datapoints become correlated:

a = ufloat(100, unp.sqrt(100))
b = ufloat(201, unp.sqrt(201))
monitor = ufloat(80, unp.sqrt(80))

print(repr(a/monitor))
print(repr(b/monitor))

# again, non-zero diagonal terms indicate that a/monitor and b/monitor are correlated
print(np.array(unc.covariance_matrix([a / monitor, b / monitor])))

gives

2.5125+/-0.3321361966498081
[[0.03515625 0.03925781]
 [0.03925781 0.11031445]]

In many cases the correlation may be sufficiently small to be ignorable, but one should at least know about it. It's effects can be surprisingly large. In the monitor example I gave one can minimise the correlation by greatly increasing the monitor counts. Given that reduction contains many steps one has to take good care of how the error propagation is carried out. correlation is something that is conveniently ignored.

Heybrock et al. wrote a paper on these kinds of effects.

lucas-wilkins commented 4 months ago

Another aspect that wasn't mentioned by @RichardWaiteSTFC was that whenever you divide pixels to do fractional rebinning then the signal in adjacent pixels is no longer independent but becomes correlated. Thus to be 100% correct the error propagation needs to take into account covariance.

Here's a demo of how it arises:

import numpy as np
import uncertainties as unc
from uncertainties import ufloat
from uncertainties import unumpy as unp

a = ufloat(15, unp.sqrt(15))
b = ufloat(20, unp.sqrt(20))
c = ufloat(25, unp.sqrt(25))

d = 0.5*a + 0.5*b
e = 0.5*b + 0.5*c

print(repr(d))
print(repr(e))

print(np.array(unc.covariance_matrix([d, e])))

gives:

17.5+/-2.9580398915498085
22.5+/-3.3541019662496847
[[ 8.75  5.  ]
 [ 5.   11.25]]

if d and e were uncorrelated the off diagonal terms would be zero.

Another consequence of taking these kinds of correlations into account is that actually all datapoints in a reduced SAS measurement are correlated! As well as this kind of fractional rebinning the reduction typically uses steps like normalising images/spectra by dividing by a common monitor count. Since this monitor count has an uncertainty then all resultant datapoints become correlated:

a = ufloat(100, unp.sqrt(100))
b = ufloat(201, unp.sqrt(201))
monitor = ufloat(80, unp.sqrt(80))

print(repr(a/monitor))
print(repr(b/monitor))

# again, non-zero diagonal terms indicate that a/monitor and b/monitor are correlated
print(np.array(unc.covariance_matrix([a / monitor, b / monitor])))

gives

2.5125+/-0.3321361966498081
[[0.03515625 0.03925781]
 [0.03925781 0.11031445]]

In many cases the correlation may be sufficiently small to be ignorable, but one should at least know about it. It's effects can be surprisingly large. In the monitor example I gave one can minimise the correlation by greatly increasing the monitor counts. Given that reduction contains many steps one has to take good care of how the error propagation is carried out. correlation is something that is conveniently ignored.

Heybrock et al. wrote a paper on these kinds of effects.

I only just saw this Andy. Sure, I get that. I have spoken to Simon and read the paper. I even have a project plan at ISIS to fix some of these things (it won't happen, but hey!). Fact is, we don't have that information.

lucas-wilkins commented 4 months ago

Correlation is taken into account wherever it is known about though.

lucas-wilkins commented 1 month ago

This should not have been closed, weird git stuff!