diego-urgell / BinSeg

GSOC 2021. R package that performs changepoint analysis using the Binary Segmentation algorithm. Supports several statistical distributions. The model is computed in C++ and then interfaced with R via the Rcpp package.
https://summerofcode.withgoogle.com/projects/5915303348797440
MIT License
4 stars 0 forks source link

why are mid/invalidates*/cost in param_mat? #20

Open tdhock opened 3 years ago

tdhock commented 3 years ago

hi @diego-urgell in Algorithms.cpp I see

             this -> param_mat[i] = i + 1;
             this -> param_mat[sep + i] = optCpt -> mid + 1;
             this -> param_mat[sep * 2 + i] = optCpt -> invalidatesIndex + 1;
             this -> param_mat[sep * 3 + i] = optCpt -> invalidatesAfter;
             this -> param_mat[sep * 4 + i] = param_mat[sep * 4 + i - 1] - optCpt -> bestDecrease;

why are all of these things being stored in the same param_mat data structure? It results in lots of magic numbers (2, 3, 4) which are potentially confusing, especially some of the values are different types (for example mid is int but bestDecrease is double). To clarify this code, can these things be stored in separate members? then we could use names rather than numbers to reference them. (for example the name invalidatesIndex would be much easier to understand than the index 2)

tdhock commented 3 years ago

see https://github.com/tdhock/binsegRcpp/blob/master/src/interface.cpp for an alternative that uses names instead of numeric indices, and Rcpp::List for return value of the interface function. (Rcpp::DataFrame could be used as well)

tdhock commented 3 years ago

btw for reading about why to avoid the "magic numbers" anti-pattern see https://en.wikipedia.org/wiki/Magic_number_(programming)#Unnamed_numerical_constants

diego-urgell commented 3 years ago

I agree that it may be confusing. I definitely have to add some comments to the code so that it is clear why I did his.

At the beginning I though of using Rcpp::List. However, this requires to define vectors for every slot that will be added to the list, and then pass references of the vector to the C++ code. This is unfeasible here because I don't know how many slots I will return to C++ (since different distributions have a different number of parameters changing). Therefore, I decided to use Rcpp::NumericMatrix because I can store the data for several values (bestDecrease, invalidatesIndex, invalidatesAfter, mean, ...) while only defining a single value in the code (param_mat).

This was basically an approach to make the returned number of parameters dynamic. I am happy to look at other suggestions on how to solve this problem. But I think using a List could no be the most convenient option for this use case. What do you think?

tdhock commented 3 years ago

you are right that "different distributions have a different number of parameters changing." so I think the param_mat should contain only those segment-specific real-valued model parameters (mean and/or variance) and everything else should be a separate arguments/variables/members with informative names and potentially different types (for example invalidates* should be int).

diego-urgell commented 3 years ago

Right, the other "general" parameters could be separate vectors and then I can add them to the return list, along with the matrix that contains the variable parameter estimations for each distribution.

This is a good option, but it is not a "quick fix" since I have to change not only the C++ code, but also the R interface, mainly BinSegModel and some BinSeg class methods. Would you mind if I do this on a few weeks? Right now I am very busy at university since next week I have a couple of crucial deliveries and presentations. I will get to this as soon as I finish with that, is it okay?

tdhock commented 3 years ago

sure, no rush. I'm just trying to help you make your code easier to understand/maintain using software engineering best practices.

diego-urgell commented 3 years ago

Thanks @tdhock 😄