OxIonics / ionics_fits

Small python fitting library with an emphasis on Atomic Molecular and Optical Physics
Apache License 2.0
0 stars 0 forks source link

gaussian fit failing #152

Open hartytp opened 5 months ago

hartytp commented 5 months ago

Test dataset

x = np.array([1847.08794244 1927.08794244 1827.08794244 1867.08794244 1767.08794244
 1757.08794244 1877.08794244 1837.08794244 1897.08794244 1807.08794244
 1747.08794244 1777.08794244])

y = np.array([[0.32 0.94 0.16 0.64 0.26 0.44 0.88 0.26 1.   0.   0.5  0.1 ])
sigma = np.array([[0.0751683  0.04370707 0.0613972  0.07703657 0.07132427 0.07930124
 0.05567317 0.07132427 0.01807572 0.01807572 0.07980068 0.05224949])
hartytp commented 5 months ago

This is solved in https://github.com/OxIonics/ionics_fits/pull/153

The underlying issue is that the dataset had too few points for the FFT heuristic to work well. While there is an improvement we should make to the FFT heuristic (see below), small datasets like this are always going to be a challenge for it, so I've added a second heuristic based on peak finding.

These two heuristics are pretty complementary: the FFT works extremely well for noisy datasets with outliers since we filter out high-frequency noise; the peak-based heuristic works really well for smaller datasets, but can't tolerate significant outliers. If you've got a dataset that's both small and has outliers then it's probably fair for ionics_fits to leave it up to you to provide some user estimates!

In the implementation of the peak heuristic, we find sigma by looking at all points which are 1/e of the peak above the baseline. We then take the peak-peak x of this dataset and calculate sigma from that. This will be thrown off, for example, if there is an outlier far from the peak which is above the 1/e threshold (or, just noise pushes a point over the line).

In practice that's probably not too much of an issue since if the dataset is large enough to have outliers that far from the peak, it's probably large enough for the FFT to work. However, it's probably worth raising the threshold up the peak a bit since 1/e is pretty close to the baseline - let's modify this to looking for the FWHMH.

Another approach would be to look for a set of continuous points above the baseline and take the peak-peak of that, rather than looking for any points above the baseline. However, that can end up underestimating sigma if we have an outlier the other way. Without making some assumptions about the kinds of datasets we're fitting, it's not clear to me which approach will be more robust in practice. So, I propose to keep it "as is" (which is the simplest thing to implement) and see how it survives contact with real data. If we have failures we can come back and resolve as appropriate.

Comments: