Use hypothesis testing to check the implementation

felixcremer commented 4 years ago

To continue the testing discussion from #12.

To provide better tests we can use the hypothesis testing which the surrogates are supposed to do in the testcases. The idea would be to have for every surrogate Method one time series which will adhere to the null hypothesis and another time series which would reject the null hypothesis. This way, we would have tested, that the surrogates are doing what they should. To test the null hypothesis which the surrogate supports we could compare the autocorrelation between the original data and an ensemble of surrogates at least for some of the surrogate methods.

In fact testing is an issue we are like you a bit unsure... We do not know exactly how to do "proper testing", since many of the "hypothesis testing" things that are discussed in the papers are rather subjective and not so much something you could do unit tests on.

What do you see as subjective with the hypothesis testing?

Datseris commented 4 years ago

Well, once you try to do exactly what you suggest here, will find out the following truths:

Hypothesis testing needs discriminatory statistics. Choosing a discriminatory statistic is a subjective thing, not an objective one. There is no mathematical equation that tells you "this is the statistic you must choose".
Hypothesis testing is not a boolean process at all. There is no yes or no outcome. You only see numeric similarity. Whether to "reject" a hypothesis or not depends on how much similarity you are willing to accept, which is totally subjective.
Some of the surrogate methods do not in fact satisfy the criteria mentioned in the papers that introduce them. This means, that e.g. for the pseudo periodic surrogates, I personally was unable to reproduce the papers that introduced them. I could not find any difference in the recommended discriminatory statistics for systems that the paper claimed that should have fundamental difference in their discriminatory statistic. I compared chaotic rossler with periodic rossler like in the paper.

Of course, feel free to perform PRs that do "proper hypothesis testing" in the test suit if you'd like. But the process is not as trivial as you would imagine. But for me this is not at all a priority. TimeseriesSurrogates is about making surrogate timeseries. What should be tested is whether the surrogates satisfy the defining properties, not what you should do with them after you have them. For example, the fourier surrogates should retain the spectrum, etc.

To give you an example: the Distributions.jl package tests whether the distributions satisfy the defining properties, not whether the e.g. Normal distribution approximates the distribution of heights of humans.

kahaaga commented 4 years ago

The issue of subjectivity

Hypothesis testing needs discriminatory statistics. Choosing a discriminatory statistic is a subjective thing, not an objective one. There is no mathematical equation that tells you "this is the statistic you must choose".

Hypothesis testing is not a boolean process at all. There is no yes or no outcome. You only see numeric similarity. Whether to "reject" a hypothesis or not depends on how much similarity you are willing to accept, which is totally subjective.

Yes, the problem of choosing a suitable discriminatory statistic is not trivial and will involve subjectivity to some extent. The same applied to decide on a threshold for rejection of the null hypothesis. These choices will be context-dependent (are the data direct measurements or proxy measurements, noise/signal ratios) and system-dependent (systems are sensitive to various discriminatory statistics to varying degrees).

Of course, feel free to perform PRs that do "proper hypothesis testing" in the test suit if you'd like. But the process is not as trivial as you would imagine.

The latter statement is why I haven't made anything but the most basic tests already.

One obvious problem is that some of the methods require you to also choose values for method parameters. The methods require careful tuning of these parameters for the particular time series you are working with. For example:

The pseudo periodic surrogates are sensitive to the noise radius parameter.
Twin surrogates (which I'm working on in #65) require you to choose 1) values for the "neighbour threshold" and 2) a distance metric when computing the recurrence matrix.

These parameters must be tuned to the particular time series you're working with, and it is not all obvious to me how to tune them given a particular time series to achieve surrogates that behave the way I desire them to.

For the random shuffle and Fourier-based methods, that is not so much of a problem, because they are parameter-free. In fact, we already test for the basic assumptions of these surrogates.

For RandomShuffle, BlockShuffle, AAFT and IAAFT, the check all([s[i] ∈ x for i = 1:N]) verifies that we are getting a shuffled version of the original version (so we test that the amplitude distributions are preserved). The AAFTand IAAFT methods are just fancy permutations. That is their defining property, and we can test that without introducing subjectivity. Hence, this check is included. In fact, these functions should be bijective, so we could make the tests stricter by testing all(sort(s) .== sort(x)).
For the RandomFourier, the amplitude distributions are not supposed to be preserved, so the all([s[i] ∈ x for i = 1:N]) test is not included.

Making more complicated tests

If there is a particular example that can verify that the methods work the way they are supposed to - as shown by the authors - then it would be nice to add that to the test suites. However, this is not unit testing per se, because what we're then doing is replicating the papers, given the subjective choices of the authors, not verifying that the implementations here actually do what they are supposed to (which is covered by the permutation tests already included where relevant). That would almost be to hypothesis test the hypothesis tests of the original authors 😁

To test the null hypothesis which the surrogate supports we could compare the autocorrelation between the original data and an ensemble of surrogates at least for some of the surrogate methods.

What you propose, @felixcremer, is absolutely possible for the Fourier methods. To test that the linear properties of the original signal are produced for the RandomFourier, AAFT and IAAFT methods, we could include a few tests. For example, we could check whether the autocorrelation at lag l(for example l = 1) is within a certain "acceptable" threshold of that of the original time series. But again, "acceptable" is again a matter of choice, which is subjective.

If you have good examples, where we have well-reasoned choices for the discriminatory statistic threshold (for Fourier methods: difference in autocorrelation at a particular lag), then feel free to add PRs. However, the tests should not be so restrictive that the package fails CI because we made too strict subjective choices for our replicate-original-paper tests.

A funny note: the statement "The ACF of the original time series coincides with the ACF of the iAAFT surrogate and the one of the TS" is used in the original twin surrogate paper to verify that the twin surrogates (TS) also preserve linear properties like AAFT. In other words, the original authors of that paper judge this visually from a plot, without any hypothesis test. If it is good enough for publication, it is good enough for our package, I guess :robot:

In summary, what we want to test is that the implementations work as described in the original papers, not that methods themselves are valid approaches to solving a set of problems, by applying our own hypothesis tests. As @Datseris, I've been having trouble replicating some of the original papers. That may be because I'm failing to interpret steps in their algorithm correctly, or because the surrogate methods are flawed themselves. Until I can do a relatively systematic study on each of them, I will not include new methods.

The "passing test" would be to replicate the original papers. For the Fourier-based methods, visually checking that the autocorrelation functions align between original time series and surrogates satisfies this criterion for me.

That is, again, a subjective choice. If that can be partly remedied by some good objective-ish numerical procedure, I won't object 😄

JuliaDynamics / TimeseriesSurrogates.jl

Use hypothesis testing to check the implementation #66

The issue of subjectivity

Making more complicated tests