Closed diego-urgell closed 3 years ago
Hi @rkillick and @tdhock, in my last commit I implemented the Negative Binomial distribution. I created two tests using the gfpop
package and the same change points are computed. However, I am not sure this is an appropriate way to test this. What do you suggest in order to test this distribution?
gfpop will give an exact solution to the problem rather than an approximate solution (as Binary Segmentation does). So if they are giving the same answer on your data sets then that is great but it is not necessary that they do. I don't believe there is a negative binomial binary segmentation implementation - which is why this project exists :-) So just test on a few examples and if you are able to use the same code but force the additional dispersion parameter to be 1 then you can compare the results with the Binary Segmentation Poisson case in the changepoint package. This may give you more confidence in the general model.
:exclamation: No coverage uploaded for pull request base (
main@6e62154
). Click here to learn what that means. The diff coverage isn/a
.:exclamation: Current head d5969da differs from pull request most recent head e476997. Consider uploading reports for the commit e476997 to get more accurate results
@@ Coverage Diff @@
## main #8 +/- ##
=======================================
Coverage ? 87.40%
=======================================
Files ? 8
Lines ? 127
Branches ? 0
=======================================
Hits ? 111
Misses ? 16
Partials ? 0
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 6e62154...e476997. Read the comment docs.
87% is good starting code coverage, well done!
about the Negative Binomial, Rebecca is right that in general gfpop (and Segmentor3IsBack) packages give optimal solutions, which are not the same as binary segmentation. However for the case of 2 segments (1 changepoint) the segmentation is the same. Again rather than comparing with another package (which is hopefully correct but may have bugs for some reason) it would be better to just compute the cost and changepoints in R code and compare that to the result you get from C++.
Hi @tdhock! Hope you are doing great :)
In today's meeting, @rkillick and I talked about the Negative Binomial cost function. As I mentioned earlier, I tested the results against the gfpop
package. Even though both packages produces similar changepoints (not always the same, but that is normal), there are some differences regarding the overall cost and the parameter estimation that troubled us.
My current implementation of the cost function is the following:
double costFunction(int start, int end){
double lSum = this -> summaryStatistics -> getLinearSum(start, end);
double mean = this -> summaryStatistics -> getMean(start, end);
double varN = this -> summaryStatistics -> getVarianceN(start, end, false);
int N = end - start + 1;
if (varN <= 0) return INFINITY;
double var = varN/N;
double r_dispersion = ceil(fabs(pow(mean, 2)/(var-mean)));
double p_success = mean/var;
return (lSum * log(1-p_success) + N * r_dispersion * log(p_success));
}
As you can see, on every possible segment I compute the overdispersion parameter r, and then the probability of success p by using the following parameterization that depends on the mean and the variance (Lindén and Mäntyniemi, 2011):
Then, I calculate the cost using the equation described by Lindén and Mäntyniemi (2011) (this is a paper you sent me while I was writing my proposal), which is pretty similar to the log-likelihood except it ignores the lgamma
terms (we had problems with this in the past too). The change points generated by using this cost function are accurate.
For example, consider the following test:
data <- gfpop::dataGenerator(n=400, changepoints=c(0.50, 1), parameters=c(0.2, 0.65), type="negbin", size=50)
graph <- gfpop::graph(
gfpop::Edge(0, 1,"std", 0),
gfpop::Edge(0, 0, "null"),
gfpop::Edge(1, 1, "null"),
gfpop::StartEnd(start = 0, end = 1)
)
ans <- BinSeg::binseg(data, "BS", "negbin", 1, 2)
check_ans <- gfpop::gfpop(data = data, mygraph = graph, type = "negbin")
The following results are produced by each of the two packages:
BinSeg | gfpop |
---|---|
So as you can see, the change points are the same! However, the cost is greatly different and the parameters also differ a little.
Checking the code of the gfpop
package, I found that the overdispersion parameter is computed at the beginning of the algorithm. Windows of size 100 are set, and the dispersion parameter is computed for each of them (by using the same parameterization of r
that I showed above. Then, the "average over dispersion" is computed and the whole signal is divided by this value. Why is this done? Also, if the data is modified then the overall cost will be different, right?.
Something that seems correct on the BinSeg
package is that the cost is effectively reduced when the signal is split. So we wonder if gfpop uses a different implementation or mathematical expression for the cost. Could you shed some light on this?
Also, the probability of success parameter p
is a little different. Do you know how it is obtained? I don't think the same expression I showed above is being used.
Overall, what are your suggestions regarding the differences between both packages? Should they be a matter of concern?
References:
Cleynen, A., Koskas, M., Lebarbier, E. et al. Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data. Algorithms Mol Biol 9, 6 (2014). https://doi.org/10.1186/1748-7188-9-6
Lindén, A., & Mäntyniemi, S. (2011). Using the negative binomial distribution to model overdispersion in ecological count data. Ecology, 92(7), 1414-1421. doi:10.1890/10-1831.1
Hi @rkillick!
I still have some questions regarding the Poisson distribution cost function. As we discussed yesterday, I was supposed to derivate the log-likelihood with respect to the rate parameter, equate it to zero, and clear for the rate. Then substitute this expression on the log-likelihood in order to get the cost function.
Here is my procedure and the resulting expression:
So, as you can see, this is the exact cost function that I implemented in BinSeg
. What troubles me is that, as we mentioned, this is different from the implementation in changepoint
package, which is
*cost = 2*x*(log(l)-log(x))+log(l);
Could you please tell me how that function is obtained? Is the expression on my screenshot correct?
Side note: I just found out that codecov was actually testing for coverage also in the C++ code!! In fact, some of the lines that are reported as not tested are lines that should never be executed in the correct package flow (for example, using getQuadraticSum
on a linear Cumusm
object, or returning false on a factory in case the self registration fails). So I guess tests code coverage is fine :))
about gfpop as I recall there is a constant variance parameter that is estimated using some heuristic before the segmentation. in Segmentor3IsBack the user can specify that parameter. you should ask @vrunge about the details of the cost function formulation.
about the code coverage "lines that should never be executed in the correct package flow" should arguably be removed from the package because that may be confusing for readers of the code. (why is this code present?) if you still want to include that code, you can use # nocov start
and # nocov end
comments to ignore those lines for code coverage computation. here is an example, https://github.com/Rdatatable/data.table/blob/master/R/fmelt.R#L11-L19
Thanks @tdhock! Yes, the dispersion is constant and calculated at the beginning. I will ask @vrunge about this.
Also, regarding the lines that should not be executed, this is because some virtual methods must be included in the parent class despite being only used by the child class in order to enable polymorphism. Thanks for the nocov
tip!
I am merging this changes for now. I am developing the R interface and the class structure, and I need some changes from this PR for everything to work together.
This PR will implement the three missing distributions: