diego-urgell / BinSeg

GSOC 2021. R package that performs changepoint analysis using the Binary Segmentation algorithm. Supports several statistical distributions. The model is computed in C++ and then interfaced with R via the Rcpp package.
https://summerofcode.withgoogle.com/projects/5915303348797440
MIT License
4 stars 0 forks source link

Missing distributions #8

Closed diego-urgell closed 3 years ago

diego-urgell commented 3 years ago

This PR will implement the three missing distributions:

diego-urgell commented 3 years ago

Hi @rkillick and @tdhock, in my last commit I implemented the Negative Binomial distribution. I created two tests using the gfpop package and the same change points are computed. However, I am not sure this is an appropriate way to test this. What do you suggest in order to test this distribution?

rkillick commented 3 years ago

gfpop will give an exact solution to the problem rather than an approximate solution (as Binary Segmentation does). So if they are giving the same answer on your data sets then that is great but it is not necessary that they do. I don't believe there is a negative binomial binary segmentation implementation - which is why this project exists :-) So just test on a few examples and if you are able to use the same code but force the additional dispersion parameter to be 1 then you can compare the results with the Binary Segmentation Poisson case in the changepoint package. This may give you more confidence in the general model.

codecov-commenter commented 3 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (main@6e62154). Click here to learn what that means. The diff coverage is n/a.

:exclamation: Current head d5969da differs from pull request most recent head e476997. Consider uploading reports for the commit e476997 to get more accurate results Impacted file tree graph

@@           Coverage Diff           @@
##             main       #8   +/-   ##
=======================================
  Coverage        ?   87.40%           
=======================================
  Files           ?        8           
  Lines           ?      127           
  Branches        ?        0           
=======================================
  Hits            ?      111           
  Misses          ?       16           
  Partials        ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6e62154...e476997. Read the comment docs.

rkillick commented 3 years ago

87% is good starting code coverage, well done!

tdhock commented 3 years ago

about the Negative Binomial, Rebecca is right that in general gfpop (and Segmentor3IsBack) packages give optimal solutions, which are not the same as binary segmentation. However for the case of 2 segments (1 changepoint) the segmentation is the same. Again rather than comparing with another package (which is hopefully correct but may have bugs for some reason) it would be better to just compute the cost and changepoints in R code and compare that to the result you get from C++.

diego-urgell commented 3 years ago

Hi @tdhock! Hope you are doing great :)

Background

In today's meeting, @rkillick and I talked about the Negative Binomial cost function. As I mentioned earlier, I tested the results against the gfpop package. Even though both packages produces similar changepoints (not always the same, but that is normal), there are some differences regarding the overall cost and the parameter estimation that troubled us.

My current implementation of the cost function is the following:

    double costFunction(int start, int end){
        double lSum = this -> summaryStatistics -> getLinearSum(start, end);
        double mean = this -> summaryStatistics -> getMean(start, end);
        double varN = this -> summaryStatistics -> getVarianceN(start, end, false);
        int N = end - start + 1;
        if (varN <= 0) return INFINITY;
        double var = varN/N;
        double r_dispersion = ceil(fabs(pow(mean, 2)/(var-mean)));
        double p_success = mean/var;
        return (lSum * log(1-p_success) + N * r_dispersion * log(p_success));
    }

As you can see, on every possible segment I compute the overdispersion parameter r, and then the probability of success p by using the following parameterization that depends on the mean and the variance (Lindén and Mäntyniemi, 2011):

Screen Shot 2021-07-22 at 18 58 48

Then, I calculate the cost using the equation described by Lindén and Mäntyniemi (2011) (this is a paper you sent me while I was writing my proposal), which is pretty similar to the log-likelihood except it ignores the lgamma terms (we had problems with this in the past too). The change points generated by using this cost function are accurate.

Test/ Example

For example, consider the following test:

  data <- gfpop::dataGenerator(n=400, changepoints=c(0.50,  1), parameters=c(0.2, 0.65), type="negbin", size=50)
  graph <- gfpop::graph(
    gfpop::Edge(0, 1,"std", 0),
    gfpop::Edge(0, 0, "null"),
    gfpop::Edge(1, 1, "null"),
    gfpop::StartEnd(start = 0, end = 1)
  )
  ans <- BinSeg::binseg(data, "BS", "negbin", 1, 2)
  check_ans <- gfpop::gfpop(data = data, mygraph = graph, type = "negbin")

The following results are produced by each of the two packages:

BinSeg gfpop
Screen Shot 2021-07-22 at 19 13 36 Screen Shot 2021-07-22 at 19 14 09

So as you can see, the change points are the same! However, the cost is greatly different and the parameters also differ a little.

Questions

Checking the code of the gfpop package, I found that the overdispersion parameter is computed at the beginning of the algorithm. Windows of size 100 are set, and the dispersion parameter is computed for each of them (by using the same parameterization of r that I showed above. Then, the "average over dispersion" is computed and the whole signal is divided by this value. Why is this done? Also, if the data is modified then the overall cost will be different, right?.

Something that seems correct on the BinSegpackage is that the cost is effectively reduced when the signal is split. So we wonder if gfpop uses a different implementation or mathematical expression for the cost. Could you shed some light on this?

Also, the probability of success parameter p is a little different. Do you know how it is obtained? I don't think the same expression I showed above is being used.

Overall, what are your suggestions regarding the differences between both packages? Should they be a matter of concern?

References:

Cleynen, A., Koskas, M., Lebarbier, E. et al. Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data. Algorithms Mol Biol 9, 6 (2014). https://doi.org/10.1186/1748-7188-9-6

Lindén, A., & Mäntyniemi, S. (2011). Using the negative binomial distribution to model overdispersion in ecological count data. Ecology, 92(7), 1414-1421. doi:10.1890/10-1831.1

diego-urgell commented 3 years ago

Hi @rkillick!

I still have some questions regarding the Poisson distribution cost function. As we discussed yesterday, I was supposed to derivate the log-likelihood with respect to the rate parameter, equate it to zero, and clear for the rate. Then substitute this expression on the log-likelihood in order to get the cost function.

Here is my procedure and the resulting expression:

poisson1-1

So, as you can see, this is the exact cost function that I implemented in BinSeg. What troubles me is that, as we mentioned, this is different from the implementation in changepoint package, which is

*cost = 2*x*(log(l)-log(x))+log(l);

Could you please tell me how that function is obtained? Is the expression on my screenshot correct?

Side note: I just found out that codecov was actually testing for coverage also in the C++ code!! In fact, some of the lines that are reported as not tested are lines that should never be executed in the correct package flow (for example, using getQuadraticSum on a linear Cumusm object, or returning false on a factory in case the self registration fails). So I guess tests code coverage is fine :))

tdhock commented 3 years ago

about gfpop as I recall there is a constant variance parameter that is estimated using some heuristic before the segmentation. in Segmentor3IsBack the user can specify that parameter. you should ask @vrunge about the details of the cost function formulation. about the code coverage "lines that should never be executed in the correct package flow" should arguably be removed from the package because that may be confusing for readers of the code. (why is this code present?) if you still want to include that code, you can use # nocov start and # nocov end comments to ignore those lines for code coverage computation. here is an example, https://github.com/Rdatatable/data.table/blob/master/R/fmelt.R#L11-L19

diego-urgell commented 3 years ago

Thanks @tdhock! Yes, the dispersion is constant and calculated at the beginning. I will ask @vrunge about this.

Also, regarding the lines that should not be executed, this is because some virtual methods must be included in the parent class despite being only used by the child class in order to enable polymorphism. Thanks for the nocov tip!

diego-urgell commented 3 years ago

I am merging this changes for now. I am developing the R interface and the class structure, and I need some changes from this PR for everything to work together.