diego-urgell / BinSeg

GSOC 2021. R package that performs changepoint analysis using the Binary Segmentation algorithm. Supports several statistical distributions. The model is computed in C++ and then interfaced with R via the Rcpp package.
https://summerofcode.withgoogle.com/projects/5915303348797440
MIT License
4 stars 0 forks source link
changepoint-detection cpp gsoc-2021 r rcpp statistics

R-CMD-check Codacy Badge Codecov test coverage

BinSeg

An R package for efficient changepoint analysis using the Binary Segmentation Algorithm, with support for several statistical distributions and types of change.

Table of Contents

Background

When dealing with time series, sometimes the trend in a signal seems to suddenly change at a certain point. Assuming that the whole data can be modelled by a particular distribution, then the changepoint splits the dataset into two different segments, each of them with distinct distribution parameters.

The objective of changepoint detection algorithms is to compute the best possible segmentation of a signal, and provide an estimation of the distribution parameters for each partition. A distribution specific loss function, usually the negative log-likelihood, is used to determine the optimal changepoints (Truong et al, 2020).

There are many algorithms to perform changepoint analysis. In practice, Binary Segmentation is the fastest one (Killick et al, 2012), and provides a heuristic model (i.e. is not exact). It is widely used for large datasets.

This package provides a fast and efficient implementation of Binary Segmenation, with support for several distributions. The changepoint analysis is performed in C++, and Rcpp is used to interface with R. An object-oriented design makes the package seamlessly extensible in terms of algorithms and distributions.

The development of this project was funded by Google Summer of Code 2021. My mentors were Dr. Rebecca Killick (@rkillick) and Dr. Toby Hocking (@tdhock). Both of them provided essential guidance and support.

Package Usage

This is a quick walk-trough of how the BinSeg package can be used to perform changepoint analysis. For a detailed documentation, please see the official manual.

Installation

First, it is necessary to install the package. Currently, the package must be installed from GitHub. It requires at least R 4.0.

if(!require(remotes)) install.packages("remotes")
remotes::install_github("diego-urgell/BinSeg")

Compute a changepoint model

For a quick example, create a random vector using normal distribution with change in mean and variance. In this case, the changepoint locations are not visually obvious.

set.seed(100)
data <- c(rnorm(25, 100, 50),
          rnorm(50, 65, 70),
          rnorm(30, 80, 40),
          rnorm(60, 45, 60),
          rnorm(25, 20, 80))
base::plot(data, type="l")

base-data

Then, perform a changepoint analysis by using the BinSegModel function. This function will return a BinSeg object, which has several methods to visualize and interact with the changepoint model. The cpts and logLik methods provide the index and overall cost (-2*logLik) for each changepoint from 1 up to the selected numCpts ordered by optmality; however, the first value at both vectors considers the whole signal.

changepoint_model <- BinSegModel(data=data, algorithm="BS", distribution="meanvar_norm", numCpts=4, minSegLen=2)
BinSeg::cpts(changepoint_model)
BinSeg::logLik(changepoint_model)
[1]  190  18 104 175  75
[1] 4210.436 4151.478 4109.260 4078.576 4058.294

To get more information about the available algorithms and distributions, call BinSegInfo(). At the moment, six distributions are supported.


Visualize the model

To visualize the model, there are two available functions. The plot function graphs the changepoints as vertical lines on top of the original data, for every model from 1 up to numCpts. It uses a facet so that each model is on a separate graph (labeled by number of segments). It is possible to indicate which models to graph by providing a vector with the desired number of segments.

BinSeg::plot(changepoint_model)

Rplot01

On the other hand, plotDiagnostic graphs the overall cost of the model as more changepoints are consideed, starting with the whole signal and up to numCpts.

BinSeg::plotDiagnostic(changepoint_model)

Rplot02

Parameter information and model validation

To get information about the parameters for a particular model (or set of models) use the coef method. It returns a data.table with parameter estimation for every segment in the selected models. In the example, only the model with five segments is considered (if no argument is provided, every model is considered).

 BinSeg::coef(changepoint_model, 5L)
   segments start end      mean  variance
1:        5     1  18 102.11350  533.1086
2:        5    19  75  71.87952 5696.4282
3:        5    76 104  77.62365 1894.5617
4:        5   105 175  40.18724 1827.2455
5:        5   176 190  17.65714 6596.6811

To validate the model, use the resid function, which computes the residuals of the model considering the estimated parameters and returns a vector of the same length as the input dataset.

Quality Assurance

Several checks were performed to ensure that the implementation of the algorithm was correct, and that the package is portable and consistent.

Check Description Badge
Continuous Intgeration (CI) The package was checked in three platforms (Windows, Ubuntu, MacOS) with R version 4.1. Each one of the checks built the package and performed an R CMD check --as-cran. This ensures it meets all the CRAN requirements for submission. GitHub Actions Build Status Badge R-CMD-check
Code Quality A static code analyzer was used to make sure the code follows best practices and is readable for other developers. Codacy Code Quality Badge Codacy Badge
Tests Coverage The testthat package was used for automated tests. More than 85 tests were written, covering every functionality of the package. The tests were executed at each CI build. Codecov Tests Coverage Badge Codecov test coverage

GSOC 2021

For ten weeks, during 2021 summer, I worked every weekday on this package as my Google Summer of Code project. It was a great experience, since I had never worked on an R package and then I created one from scratch. I learnt a lot of things during the development. Before this project, I had very little knowledge of statistics, but I understood more things along the way. I also got to know new patterns for object-oriented design in C++. Furthermore, I gained practical experience with software development best practices, such as planning, documenting, testing, and using CI. I also have a very clear idea of how Open Source develoment works, and I am excited to contribute to other projects. Most importantly, I experienced the development process of bigger project (compared to my past projecs), and how many challenges and obstacles appear and must be overcome to create a succesful final product.

Acknowledgements

I feel very grateful to my mentors Dr. Rebecca Killick and Dr. Toby Hocking. They provided very valuable feedback, suggestions, and support. I learnt a lot on our meetings. Here is an account of each of them.

I am also grateful to the Google and the GSOC administrators, who make possible this amazing experience.

Achieved goals

Challenges

During the development, several challenges were faced:

Future Development

The package is fully functional right now, and the realease for GSOC can be installed and used by anyone. The expected goals for this summer were pretty much fulfilled (with the exception of the negbin function validation), and some extra functionality was implemented. However, there are many things that can (and will) be done in the future: