Closed ceholden closed 8 years ago
I've actually thought a little bit about this too. In the short term, I get the feeling a lot of what users are using for their thresholds are found using trial and error. So in terms of selecting the correct threshold I am not sure changing it to a p value will make much of a difference (although it would seem a little more statistical). That does beg the question of whether trial and error will really yield better results than deriving it using a chi distribution, however. That being said, it is something important to consider documentation could definitely be useful.
What I think would be nice would be some sort of well thought out workflow in choosing what bands to use and threshold. This would be something done for each scene and a little more involved than just clicking around in TSTools and changing the threshold. If we could take the time to document these 'change scores' or residuals for each bands for the different landcover classes and change classes we could easily test what the threshold should be in a more robust fashion. Thoughts?
I am in favor of converting the threshold parameter to p_value
and retrieve the threshold using scipy.stats.chi.isf
for a given number of test_indices
.
I definitely understand the concern that this could be misleading/confusing since we are not actually finding change using a formal null hypothesis testing framework, but I think the chi distribution value we are currently using is no less confusing and most of us are selecting by trial and error in a less than meaningful way. By internally selecting the critical value based on a user-defined p-value and number of test indices, we would (a) ensure comparability across model runs with variable number of test indices (something that is currently easy to overlook!) and (b) put the critical value into terms that are likely more familiar to users (i.e. I think it would be easier to explain that a p-value of 0.1 will be more sensitive to change compared to a p-value of 0.01 versus trying to explain where the threshold of "3.368" came from).
I also agree that there should be better documentation explaining the interpretation of p-value/critical value in this context and how it is selected. In the next couple of days, I plan to write up what I learned yesterday from Chris and I discussing all of the inputs as a starting point.
To Eric's point about a workflow, I am also working on some test runs comparing change results for different parameter sets. I'm using a relatively small subset (~500,000 pixels) in the Broadmoor/Boston area, but it should be interesting to see if/how tweaking different parameters, fit types, and design matrices will impact the change results. I'll keep you both posted on what I find, and if there are any useful lessons learned that we could share as examples in the documentation. Once I have a good process for comparing/more standard set of model runs, might also be worth running a few other subsets for further comparison.
Some arguments against, which is currently my opinion:
threshold
seems correctly "less statistical"We could also just allow both in the CCDCesque
configuration. If the user specifies p_value
, then we back out threshold
against the chi distribution. If threshold
is specified, well then there's no difference versus current behavior. If the user specifies both parameters, we keep threshold
as being authoritative and yell at the user for being an idiot \s. Only downside is this might add more confusion than currently exists.
I'm definitely in favor of better documentation (see #35 for documenting how we go about selecting parameters). Valerie's test runs ought to also give us some actual numbers behind what's just been a speculative sense of the sensitivity of each parameter, and this would make for some great documentation material.
I'm also up for better framing the threshold
parameter by either running some sort of simulation experiment or by analyzing what kinds or change processes produce what kinds of magnitudes of change. I've always very much liked Figure 4 in Jan's 2012 BFAST where Jan shows the probability of detecting a change given d
observations of the disturbance and m
magnitude of said disturbance as a function of the noise in the timeseries. Doing this sort of thing in multivariate space probably complicates this, but it would be great if we could at least demonstrate some simple examples. There's also a bit of a replication analysis to be done in looking at what bands are most affected by certain types of change processes (I think Mike Wulder showed this? Or was it Warren Cohen?)
I think it makes more sense to continue using the threshold and try to figure out the possible sensible ranges and their outcome for a given group of parameter sets. I believe that using the p value only adds one more layer of complexity, gives it a "statistical look" that it really doesn't have, and doesn't solve the problem of having to pick it up by trial an error. Ultimately, I think that only a sensitivity analysis similar to what Val is running will give us some guidelines on how to choose the threshold (and other parameter values). Maybe then we will be able to create some kind of table/plot/widget in which we can show for example how playing with certain ranges of values makes the change detection more or less likely to happen.
I guess my main concern with the current threshold
is the lack of insight on how to effectively change values when you change the number of test indices...
I think the p-value approach is advantageous because it handles the connection to the test_indices
under the hood. This way, you can set a constant p-value, change the number of test indices, and still produce comparable results. With the threshold set the way it is now, it is up to the user to understand that the threshold value must change depending on the number of test indices (so if you want a comparable result for runs with different numbers of test indices, you must change two parameters [test_indices
and threshold
] instead of one [test_indices
, keeping p-value
constant]). And unless you do use values from the chi-not-squared distribution for a particular p-value, it's hard to determine comparable thresholds, e.g. Chris helped me figure out that for p=0.01, if you have 3 test indices (k=3), threshold=3.643 while if you have 4 test indices (k=4), threshold=3.368. I would have never come up with those numbers through trial and error, but those are the numbers I needed to try to compare results from 3 indices to results from 4 indices. I might go back and look up values for p=0.05, but I still need them to come from the distribution so they are comparable for numbers different test indices.
If there is a way to account for differing numbers of test indices in another way, that could work too. For example, at one point yesterday, I think we talked about dividing the L2 norm by the number of test bands (or something like that?). Anything that ensures that when I change the test indices, my comparison to the threshold is adjusted accordingly would address my concern.
Maybe another option would be to abstract away from a p-value per se. I mean, when you select a threshold, you are in some sense arbitrarily relating to the chi-not-squared distribution. Maybe we could have a range of values that correspond to p-values but aren't p-values. For example, maybe threshhold
simply equals p-value*100 such that a threshold of 5 (p=0.05) would be more sensitive to change than a threshold of 1 (p=0.01). Under the hood, the input could be divided by 100 to convert to a p-value for lookup of the critical value, but the user wouldn't be left with the impression that they are specifying a p-value. Something like that.
I might also just be over-complicating this. I don't know how many users will actually want to produce and compare maps from different test indices. But after doing a couple of runs changing my inputs to see whether the original bands v. BGW produced different results, only to find that the results were not actually a fair comparison unless I adjusted the threshold accordingly, I think it is worth coming up with a better way of handling the relationship between threshold
and test_indices
going forward. I'm all for better documentation, but I think there should also be safeguards in case people don't read/understand the parameters the way we would hope.
OK long post below outlining why I'm going to eliminate option 2
as a possibility for becoming default method for specifying threshold
. This leaves 1
or 3
.
I'm in favor of 3
as an accommodation since it's so easy to implement. Any reasons for objection?
I think I understand the argument in favor of 2
, but I disagree that we should be making things easier to understand to make it more approachable, to not trip people up, or to make it more of a "set and forget" parameter. There are far too many nobs to turn in CCDC for it to be approachable anyway, and I'm afraid hiding any of them would turn the experience into, "The defaults didn't work so I'm giving up". There really aren't any defaults (except that there has to be something to show the datatype/etc.)!
The larger problem is that the chi distribution isn't suited for this purpose anyway. We're not actually looking at the sum of standard normally distributed variables (N(0, 1)
) for the chi, or the sum of the squares of N(0, 1)
variables. The two criteria:
residuals / rmse
residuals / rmse
1 / std
guarantees you have std=1
)RMSE
and residual std
are basically identical, but RMSE
compares y
against yhat
while residual std
compares resid
against mean(resid)
RMSE
and std(resid)
are somewhat close sometimes, but they're not equivalent (see below)Simple code example in R using mtcars
dataset:
> fit <- lm(mpg~hp, data=mtcars)
> rmse <- mean(residuals(fit) ^ 2) ^ 0.5
> resid_std <- var(residuals(fit)) ^ 0.5
> print(c(rmse, resid_std))
[1] 3.740297 3.800146
So if we're not actually summing N(0, 1)
data, I'm going to rule against the p_value
implementation via technicality as the self proclaimed BDFL.
It seems quite a bit of a shame though to have to dismiss it on a technicality. Maybe this leads to some discussion about reshaping the "test statistic". We could, for example, just use the std(resid)
instead of RMSE
or maybe even do some sort of robust standard deviation estimate instead, but that's all for another ticket and larger group discussion.
The "test statistic" used in CCDC is the square root of the sum of squared scaled residuals for all bands tested. This "test statistic" isn't normalized by how many bands you're adding, so the critical value needs to depend on how many test indices you have. If you use more indices to test with, then you'll need to increase the critical value by some amount.
Zhu derives the "test statistic" critical value for
p=0.01
andk=len(test_indices)
using the inverse survival function of thescipy.stats.chi
distribution since we're probably summing squares of normally distributed variables (the scaled residuals).Questions of independence, normality, statistical soundness, etc. aside, my biggest concern is that we don't really care about finding change according to some null hypothesis testing framework value. CCDC is, at best, vaguely statistical and we've never analytically or numerically explored what the distribution of the "test statistic" is under the null hypothesis of no change. However, using the
scipy.stats.chi.isf
does convey the important message that the critical value depends on how many bands are being tested.So, the proposed solution either includes:
threshold
parameter top_value
and retrieve thethreshold
usingscipy.stats.chi.isf
for a given number oftest_indices
threshold
andp_value
withthreshold
being the default input that overridesp_value
if both are specifiedThoughts, @bullocke, @valpasq, @parevalo, and @xjtang?