Closed diego-urgell closed 3 years ago
First of all I avoid writing tests like this: "the second one pases around 90% of times." This is a test that you computed the "right" changepoint for randomly generated data? The "right" answer depends on the randomness so I would avoid writing tests like this. Solutions:
I don't understand your problem. Can you clarify / describe in a different way? is normal change in mean with constant variance working? if not why not? how do you know? maybe the problem is that you need to enforce a min segment length? when you need to estimate the variance parameter, you need at least two different data points per segment.
I understand, as the data is random then I agree we can't really test by expecting the changepoint on index 9 exactly since the value obtained from the algorithm might be different while still being correct in terms of costs given the data.
Would it be okay to test against the results that the changepoint
package produces for the same data using BinarySegmentation? It is expected that both implementations should yield the same result since they both use the same algorithm, right?
Normal change in mean with constant variance is working fine. The one with the problem was normal change in mean and variance, since it yielded the second to last datapoint as the optimal change point.
However, it is solved. You were right! The problem was indeed the minimum segment length. I included a parameter to set it from the R interface, and used a value of 2 for the Normal change in mean and variance. The following error was happening in the Segment::optimalPartition()
function:
this -> mid
(which stores the optimal partition index of the segment) was constantly changing. this -> mid
remained constant with the changepoint's index (in my tests this was 9, because there were 20 datapoints and the change was in the middle), since it is the partition that produces the lowest cost on the segment. When I enforced the minimum segment length of two, this no longer happened since a segment with a single datapoint and zero variance is never considered.
Thanks @tdhock!
Would it be okay to test against the results that the
changepoint
package produces for the same data using BinarySegmentation? It is expected that both implementations should yield the same result since they both use the same algorithm, right?
I would say that this is a very sensible strategy. You may first want to check that the algorithms are giving the same cost function output on a no-change example before checking on changepoint examples. I'd be interested in the timings too, if there is a material difference it may be worth pointing changepoint to the new package instead of running the existing code.
great, I'm glad that helped. If you use changepoint package in tests then you will need to add it in Suggests and use it conditionally, if(requireNamespace("changepoint")){ changepoint::cpt.meanvar(...) }
To avoid introducing dependency on changepoint package I would recommend just using non-random data and computing the answer yourself by hand or in R code using base functions (dnorm etc).
About the min segment length, you should be careful because the issue comes from having a segment with zero variance, and that can happen on a segment with two data points. You should add some tests for this case. For example with change of mean and variance your code should return
@rkillick I changed the existing tests to use the changepoint
package value as the reference. The tests pass, and both packages generate the same results for the three distributions that were already implemented 😁. Later on I will also make timing tests, but I would like to complete more of the package before.
You may first want to check that the algorithms are giving the same cost function output on a no-change example before checking on changepoint examples.
I did not quite understand this. You mean generating data with no changepoint in order to test whether both packages get the same cost?
@tdhock for the non random data, would it be convenient to use a dataset such as neuroblastoma
(which I used for the pre-GSOC tests)?
- no possible splits for [1.1, 1, 2, 2, 2] -- each possible split would result in a segment with zero variance => infinite cost.
- split after data point 2 for [1.1, 1, 1, -5 ] -- all other splits result result in a segment with zero variance => infinite cost.
In order to solve this, I made the change of using INFINITY
when the variance is undefined. More exactly, I used the double's numeric limit as an infinity value, since I have had problems when using INFINITY
. This works well, and when there is no possible changepoint in a segment given the variances, it remains in zero (which is the this -> mid
starting value in the Segment
class), and the process is finished. Later on I will add a condition in the R code which raises an error or warning if there was an infinite variance and no possible segments.
Also @tdhock @rkillick, in order to test the total loss and the parameters of every implemented Distribution
subclass, I started implementing the params return mechanism we talked about. Every distribution class will have the necessary vectors to store the required parameters, and two functions calcParams
, which computes and stores the parameters on every iteration of the binseg algorithm, and retParams
which will work as an interface with the RcppInterface file.
This works well for the norm_mean
distribution, and it will extend nicely with all the other ones (I will implement for mean_var
and meanvar_norm
this week).
@rkillick I changed the existing tests to use the
changepoint
package value as the reference. The tests pass, and both packages generate the same results for the three distributions that were already implemented . Later on I will also make timing tests, but I would like to complete more of the package before.
Glad that they are matching on the changepoints. Have you tried some obvious changes as well as smaller more challenging ones? The smaller changes will have more variability and so is a better test of it they are really giving the same answer.
You may first want to check that the algorithms are giving the same cost function output on a no-change example before checking on changepoint examples.
I did not quite understand this. You mean generating data with no changepoint in order to test whether both packages get the same cost?
I meant that instead of comparing the changepoint locations, you can compare the cost functions for the segments. This is more accurate for testing that they are the same as if they give the same cost function then you know that the cost function code is matching. If they match on the cost functions but then don't match on the changepoints then you know that it is the algorithm code where the problem is.
hi @diego-urgell I see a lot of std::vector<double>
(STL vector) and I wonder if you should consider using a simple double*
(double array) instead for efficiency? Anyways we need double array for the R interface right? And if you are internally using STL vectors then that means you would have to do an (un-necessary) copy to the R double array, right? It would be more efficient to just use double arrays internally and avoid the copy.
Hi @tdhock, you are right, I was using std::vector<double>
in such a way that required a complete copy of every parameter vector on the Rcpp interface. I followed your advise and I created an Rcpp::NumericMatrix
and then passed a reference to the C++ code. Then I modified it in-place, what resulted in no useless copy operations.
Also, I think this PR is complete. The package is working for all the three changes involving normal distribution. Several tests were performed using the changepoint package as reference. @rkillick and I agreed that, later on, it would be convenient to substitute the changepoint package result with static values in order to avoid dependency on the future.
On the tests, I included cases where there are no real change points on the data, and both packages compute the same theoretical change points. The only thing missing to test here is that the cost computed is the same. However, I will create them when the R interface is more advanced, since the coef method would be useful. Also, after the other distributions are implemented, I will start testing with real-world datasets.
I will merge this PR in order to open a new one. Is that ok @rkillick @tdhock? In the future I will try to make more focused PRs, since this one included many changes. from different features.
Great, well done on getting this far. Make sure to look at the likelihood values and not just the coef's. Also bear in mind that the changepoint package calculates the likelihood in R and not in C so uses different functions. Go ahead and merge so we can get towards the documentation and user-facing functions.
Description
In this branch I will implement the cost functions for the following distributions:
Currently, mean and variance is implemented. However, there have been complications.
Tests
I added three tests to the
test-BinarySegmentationTests.R
file in order to try three different types of data with the NormMeanVar distribution:The first test passes almost always, while the second one pases around 90% of times. However, the third test fails very frequently, yielding a change point of 18. I have been debugging this a couple of days, but I haven't managed to solve it.
Anyway, using
lldb
with a Python script, I found something pretty interesting. The data I set was a vector of length 20, using the parameters from test 3. The value of thei
counter begins on 1 and ends in 18 on the first execution of this function.The script was designed to print the
currSplitCost
and thethis -> mid
variables on filesSegments.cpp
line 48, which is part of theoptimalPartition
function. The following screenshot shows the values:As you can note, on the first execution of the function (before curr_mid returns to 1), the last value of curr_mid is 9 (which is correct), and the costs indeed show that this is the optimal partition. However, when this process is finished and I check the returned answer from the R console, I get the following (ignore the first lines, since they were of unused executions of optimalPartition function):
Clearly there is something going on that changes the cpts value to 18 despite optimalPartition finding the correct one. However, I haven't found where this happens exactly. I will continue working on this because is probably a mistake we would face on the rest of the distributions.
For now, I just wanted to let you know what was going on. @tdhock @rkillick , do you have any idea where the error could be?
Note: I explicitly created the BS and NormMeanVar classes instead of using the macro because it is impossible to test with lldb otherwise. When everything is working I will switch them both to the Macro.