Package 'gghmm2' feedback

Purpose of package

The purpose of the package is quite clear. As promised, it handles the modeling and analysis of hidden Markov models (HMMs). It does not necessarily fill a gap, nor does it complement already existing packages.

Completeness

The package contains the following functions in regard to the modeling and analysis of HMMs

HMM: A class factory that produces S3 object HMM.
em: A function that fits a HMM given observations.
forecast: A function that returns the probability of observing a specific value at a given time (out of sample) conditioned on a HMM and observations.
local_decoder: Calculates the most probable hidden state of the emission given a HMM.
viterbi: Finds the most likely sequence of hidden states conditioned on emissions and a HMM.
state_prob: Calculates the probability of being in a specific state at a time inside the sample conditioned on emissions and a HMM.

The above allows one to fit with custom marginal distributions a hidden Markov model to some observed emissions. While the functions may not produce usual fitting statistics, they allow for a flexible fitting and posterior analysis. As far as I can tell, the description of the package does not mention tidyverse. Note that the class HMM is a subclass of tibble, an object from the aforementioned package.

Code quality and sophistication

The package did install without problems. Challenges to the package include setting the initial parameters of the fitting process close to one another or trying to fit with an odd marginal distribution. (Issues raised on Github) There are no tests in the package, as far as I can tell.

As mentioned, the HMM function returns S3 object HMM. The only method that this class has is print. So S3 objects are used, but not to their full potential. Overall, the package makes use of R's vector-oriented structure and avoids loops for the most part, which is a plus. Still, there are plenty of places where one could speed up the code. In the function viterbi, one could avoid calling do.call per loop by simply computing a large matrix of emissions before entering the for loop. Furthermore, numerical stability is not great in the functions forward or backward since $\beta_t = 0$ for small $t$ and $\alpha_t = 0$ for large $t$. (Read: Underflow.) Underflow is a problem since the function em relies on the forward and backward probabilities to fit. Consider having everything computed in terms of log in the EM-algorithm instead.

Documentation and data

For the uninitiated user, say one that has little to no knowledge of HMMs, the description of the functions may seem obscure. Consider adding some LaTeX specifying what exactly a HMM is. One with prior knowledge of HMM would have no problem understanding the descriptions of the functions. There is a Vignette in the package, which demonstrates how to fit a HMM to the included data eartheqakes.rda. The Vignette justifies the included functions in regards to modeling and analysis of the included data.

Conclusion and suggestions

Overall, the package is good in regards to fitting HMMs using the EM algorithm. The posterior analysis is quick and easy to use and provides great insights. There are places that could use improvements. Here are some suggestions

Either do not use S3 objects or add more methods to the class HMM.
Speed up the functions by avoiding excessive calls in the for loops. Consider instead computing the emission probabilities at the start of the functions.
Calculate the log-forward and log-backward probabilities instead of the actual probabilities to avoid underflow. Doing this yields better results in the EM algorithm. See https://en.wikipedia.org/wiki/LogSumExp.
Consider adding an option to provide maximum likelihood estimators instead of numerically optimizing the log-likelihood function in the EM algorithm. Adding this option will, in many cases, yield a significant speed increase as many distributions have closed-form MLEs.

AdvancedR-2021 / gghmm2