Closed spribitzer closed 3 years ago
Enforcing zero-derivative at the edges of the distribution is definitely a must. The fact that we discretize the distance vector r
to a set of distances constrained to a certain rmin
and rmax
implies that we assume that outside that range P(r)=0
. This is an unchallengable fact.
Now given this, if a zero derivate is not imposed upon the edges we would allow discontinuities at P(rmin)
and P(rmax)
, which would be unacceptable from a mathematical and modelling point of view (as it is practically physically impossible).
I am aware that the rise of distribution mass at the edges is a popular tool for diagnosing problems in the data analysis. This is sadly another leftover from DeerAnalysis and is IMO a very dangerous approach to handling things.
First, to diagnose numerical phenomena such as non-identifiability there are multiple approaches directly suited for that. Employing a faulty model with a faulty mathematical definition just as a substitute of more suitable analysis methods is wrong.
Second, employing the edges of the distributions as diagnostic tools results in a purely subjective analysis of the data, which goes against all the ongoing effort to impose principles to ensure automatic and "bias-free" workflows.
Summarizing:
P(r)
is both the mathematically and physically correct definition.So unless presented with stronger evidence/arguments I will maintain the current default behavior in DeerLab. As always, if someone wants/requires the other functionality it is now readily available to use. So there should be no problems.
We have two models that differ in how they treat the edges: let's call them model S (enforces smoothness at edges) and model N (does not enforce smoothness at the edges).
If the distance range is appropriate, then P tends towards zero at both edges for both S and N, and there is no difference between the two models. They are practically equivalent.
If the distance range is inappropriate, or if there are foreground/background identifiability issues, then the models differ. In this case, it's not that S is better than N. No, neither of them is appropriate. The problem with S is that it gives no indication that it is wrong. In contrast, N provides a nice direct visual indicator of the problem. Why eliminate this useful visual feature?
Of course, N makes less sense physically - why would a distribution be smooth everywhere except at two points? This is however only an objection in principle. Practically, it never matters when the models actually fits and there are no identifiability issues: Again, S and N give the same result if the model is otherwise appropriate and identifiable.
We should leave this issue open, make sure the regoperator
option includeedges
can be set for all fit functions, and explore whether N or S lead to any edge cases that are visually confusing. Once identifiability is correctly handled (with a second regularization term) and distance ranging is more automatic, then we can revisit this.
I completely agree with Luis' point of view. In particular the fact that there cant be density outaide of the modeled P(r) range. And that tails are an artifact that are being used to assess quality of a background correction is very unfortunate and clearly stems from an old, but wrong way of thinking about the data. Both points are things I mentioned in group meetings but were meet with resistance. Tikhonov implies a uniform smoothness across r, so having raising tails clearlyeans that there is density outside the r range. If you want to argue that this "just corresponds" to a step drop/cutoff, then that does not work in conjuction with Tikhonov and how we do Tikhonov. Both arguments for tails are therefore wrong in my opinion. Thinking about this, I believe the option to have floating edges should be removed again.
While I agree that having discontinuities is theoretically incorrect, I think this may be a case where practicality beats purity. The problem is that a novice user sees a smooth P(r) and thinks "That looks great I'm done" and never considers running any additional tests. In contrast, if there are discontinuities at the tails, even the most basic user can tell that there is something physically wrong. It is unrealistic to expect a beginner to use specialized methods when the data "look good" on the first run.
Additionally, Pribz's data clearly suggest that this assumption creates a false minimum as the bottom ensemble has less variance than the top ensemble. Again, you can't expect beginners to look at the bottom and conclude that specialized methods are needed.
I really think the default should be to exclude the edges from smoothing, for all the reasons stated above by me and mtessmer.
This relates to the recent PR #204 and commit 6ad50949813b6a7b7ab9e2a7122a1124d019e551. Thinking about whether the regularization operator should smooth the first and last points as well, I ran a simulation on a particular bad data set.
Here, you see the Bayes results for the regularization operator not enforcing P(1) = P(end) = 0 (the newly added option to
regop
):This is the result if P(1) = P(end) = 0 is enforced (the standard
regop
behavior):The bottom previously was the only regularization operator and is currently default. Now the question is what should be the default. Personally I feel like the bottom makes more sense from a logical, mathematical standpoint. However, due to the restriction of the data points, I guess, a side effect is, that the P(r)s look very similar and therefore might not point to the bad quality of the data as strongly as the top example - in the Bayesian approach at least. How that manifests in
deerlab
could be different.It should also be noted that the differences between the two regularization operators are not usually this stark, but this is a case of particularly bad data.