diego-urgell / BinSeg

GSOC 2021. R package that performs changepoint analysis using the Binary Segmentation algorithm. Supports several statistical distributions. The model is computed in C++ and then interfaced with R via the Rcpp package.
https://summerofcode.withgoogle.com/projects/5915303348797440
MIT License
4 stars 0 forks source link

Cost function for both remaining normal distribution changes #7

Closed diego-urgell closed 3 years ago

diego-urgell commented 3 years ago

Description

In this branch I will implement the cost functions for the following distributions:

Currently, mean and variance is implemented. However, there have been complications.

Tests

I added three tests to the test-BinarySegmentationTests.R file in order to try three different types of data with the NormMeanVar distribution:

  1. Change in mean with constant variance
  2. Change in variance with constant mean
  3. Change in mean and variance

The first test passes almost always, while the second one pases around 90% of times. However, the third test fails very frequently, yielding a change point of 18. I have been debugging this a couple of days, but I haven't managed to solve it.

Anyway, using lldb with a Python script, I found something pretty interesting. The data I set was a vector of length 20, using the parameters from test 3. The value of the i counter begins on 1 and ends in 18 on the first execution of this function.

The script was designed to print the currSplitCost and the this -> mid variables on files Segments.cpp line 48, which is part of the optimalPartition function. The following screenshot shows the values:

Screen Shot 2021-06-29 at 23 54 52

As you can note, on the first execution of the function (before curr_mid returns to 1), the last value of curr_mid is 9 (which is correct), and the costs indeed show that this is the optimal partition. However, when this process is finished and I check the returned answer from the R console, I get the following (ignore the first lines, since they were of unused executions of optimalPartition function):

Screen Shot 2021-06-29 at 23 54 55

Clearly there is something going on that changes the cpts value to 18 despite optimalPartition finding the correct one. However, I haven't found where this happens exactly. I will continue working on this because is probably a mistake we would face on the rest of the distributions.

For now, I just wanted to let you know what was going on. @tdhock @rkillick , do you have any idea where the error could be?

Note: I explicitly created the BS and NormMeanVar classes instead of using the macro because it is impossible to test with lldb otherwise. When everything is working I will switch them both to the Macro.

tdhock commented 3 years ago

First of all I avoid writing tests like this: "the second one pases around 90% of times." This is a test that you computed the "right" changepoint for randomly generated data? The "right" answer depends on the randomness so I would avoid writing tests like this. Solutions:

tdhock commented 3 years ago

I don't understand your problem. Can you clarify / describe in a different way? is normal change in mean with constant variance working? if not why not? how do you know? maybe the problem is that you need to enforce a min segment length? when you need to estimate the variance parameter, you need at least two different data points per segment.

diego-urgell commented 3 years ago

I understand, as the data is random then I agree we can't really test by expecting the changepoint on index 9 exactly since the value obtained from the algorithm might be different while still being correct in terms of costs given the data.

Would it be okay to test against the results that the changepoint package produces for the same data using BinarySegmentation? It is expected that both implementations should yield the same result since they both use the same algorithm, right?

diego-urgell commented 3 years ago

Normal change in mean with constant variance is working fine. The one with the problem was normal change in mean and variance, since it yielded the second to last datapoint as the optimal change point.

However, it is solved. You were right! The problem was indeed the minimum segment length. I included a parameter to set it from the R interface, and used a value of 2 for the Normal change in mean and variance. The following error was happening in the Segment::optimalPartition() function:

  1. Before reaching the actual changepoint, the value of this -> mid(which stores the optimal partition index of the segment) was constantly changing.
  2. After the actual changepoint was passed, the value of this -> mid remained constant with the changepoint's index (in my tests this was 9, because there were 20 datapoints and the change was in the middle), since it is the partition that produces the lowest cost on the segment.
  3. When the second to last datapoint was reached (18), it was selected as the changepoint the reason is that the segments were divided as Left -> 0-18, Right -> 19-19. As the right segment only included one datapoint, it had a zero variance and it added very little cost to this partition. Therefore, it was selected as the optimal changepoint.

When I enforced the minimum segment length of two, this no longer happened since a segment with a single datapoint and zero variance is never considered.

Thanks @tdhock!

rkillick commented 3 years ago

Would it be okay to test against the results that the changepoint package produces for the same data using BinarySegmentation? It is expected that both implementations should yield the same result since they both use the same algorithm, right?

I would say that this is a very sensible strategy. You may first want to check that the algorithms are giving the same cost function output on a no-change example before checking on changepoint examples. I'd be interested in the timings too, if there is a material difference it may be worth pointing changepoint to the new package instead of running the existing code.

tdhock commented 3 years ago

great, I'm glad that helped. If you use changepoint package in tests then you will need to add it in Suggests and use it conditionally, if(requireNamespace("changepoint")){ changepoint::cpt.meanvar(...) } To avoid introducing dependency on changepoint package I would recommend just using non-random data and computing the answer yourself by hand or in R code using base functions (dnorm etc). About the min segment length, you should be careful because the issue comes from having a segment with zero variance, and that can happen on a segment with two data points. You should add some tests for this case. For example with change of mean and variance your code should return

diego-urgell commented 3 years ago

@rkillick I changed the existing tests to use the changepoint package value as the reference. The tests pass, and both packages generate the same results for the three distributions that were already implemented 😁. Later on I will also make timing tests, but I would like to complete more of the package before.

You may first want to check that the algorithms are giving the same cost function output on a no-change example before checking on changepoint examples.

I did not quite understand this. You mean generating data with no changepoint in order to test whether both packages get the same cost?

diego-urgell commented 3 years ago

@tdhock for the non random data, would it be convenient to use a dataset such as neuroblastoma (which I used for the pre-GSOC tests)?

  • no possible splits for [1.1, 1, 2, 2, 2] -- each possible split would result in a segment with zero variance => infinite cost.
  • split after data point 2 for [1.1, 1, 1, -5 ] -- all other splits result result in a segment with zero variance => infinite cost.

In order to solve this, I made the change of using INFINITY when the variance is undefined. More exactly, I used the double's numeric limit as an infinity value, since I have had problems when using INFINITY. This works well, and when there is no possible changepoint in a segment given the variances, it remains in zero (which is the this -> mid starting value in the Segment class), and the process is finished. Later on I will add a condition in the R code which raises an error or warning if there was an infinite variance and no possible segments.

diego-urgell commented 3 years ago

Also @tdhock @rkillick, in order to test the total loss and the parameters of every implemented Distribution subclass, I started implementing the params return mechanism we talked about. Every distribution class will have the necessary vectors to store the required parameters, and two functions calcParams, which computes and stores the parameters on every iteration of the binseg algorithm, and retParams which will work as an interface with the RcppInterface file.

This works well for the norm_mean distribution, and it will extend nicely with all the other ones (I will implement for mean_var and meanvar_norm this week).

rkillick commented 3 years ago

@rkillick I changed the existing tests to use the changepoint package value as the reference. The tests pass, and both packages generate the same results for the three distributions that were already implemented . Later on I will also make timing tests, but I would like to complete more of the package before.

Glad that they are matching on the changepoints. Have you tried some obvious changes as well as smaller more challenging ones? The smaller changes will have more variability and so is a better test of it they are really giving the same answer.

You may first want to check that the algorithms are giving the same cost function output on a no-change example before checking on changepoint examples.

I did not quite understand this. You mean generating data with no changepoint in order to test whether both packages get the same cost?

I meant that instead of comparing the changepoint locations, you can compare the cost functions for the segments. This is more accurate for testing that they are the same as if they give the same cost function then you know that the cost function code is matching. If they match on the cost functions but then don't match on the changepoints then you know that it is the algorithm code where the problem is.

tdhock commented 3 years ago

hi @diego-urgell I see a lot of std::vector<double> (STL vector) and I wonder if you should consider using a simple double* (double array) instead for efficiency? Anyways we need double array for the R interface right? And if you are internally using STL vectors then that means you would have to do an (un-necessary) copy to the R double array, right? It would be more efficient to just use double arrays internally and avoid the copy.

diego-urgell commented 3 years ago

Hi @tdhock, you are right, I was using std::vector<double> in such a way that required a complete copy of every parameter vector on the Rcpp interface. I followed your advise and I created an Rcpp::NumericMatrix and then passed a reference to the C++ code. Then I modified it in-place, what resulted in no useless copy operations.

diego-urgell commented 3 years ago

Also, I think this PR is complete. The package is working for all the three changes involving normal distribution. Several tests were performed using the changepoint package as reference. @rkillick and I agreed that, later on, it would be convenient to substitute the changepoint package result with static values in order to avoid dependency on the future.

On the tests, I included cases where there are no real change points on the data, and both packages compute the same theoretical change points. The only thing missing to test here is that the cost computed is the same. However, I will create them when the R interface is more advanced, since the coef method would be useful. Also, after the other distributions are implemented, I will start testing with real-world datasets.

I will merge this PR in order to open a new one. Is that ok @rkillick @tdhock? In the future I will try to make more focused PRs, since this one included many changes. from different features.

rkillick commented 3 years ago

Great, well done on getting this far. Make sure to look at the likelihood values and not just the coef's. Also bear in mind that the changepoint package calculates the likelihood in R and not in C so uses different functions. Go ahead and merge so we can get towards the documentation and user-facing functions.