Wishes - parallel + NAs

MarieAugerMethe commented 2 years ago

Hi Brett and Theo! I love your package, and we are just doing a workshop on how to use it with marine data, see: https://github.com/MarieAugerMethe/DFO_HMM_2022

I have two little wishes that if you have time one day would be great to implement.

Would it be possible to have the preData function have a option to place NAs in the response variable when we are missing a data point (e.g. we have regular time series with locations at 8:00, 8:15, 8:30, 9:00; could we get a NA place in the step length and turning angle or dive at 8:45).
Would it be possible to have an option for parallel computing for fitHMM. It would be particularly useful for the retryFit option.

Thanks a million for the wonderful package!

bmcclintock commented 2 years ago

Hi Marie, I hope the workshop went well. And thanks for the suggestions.

At least initially, I'm not sure how prepData could implement your first suggestion. I'll think on it some more, but I don't see how the function could "know" a location is missing unless all of the desired observation times are supplied via an additional argument to the function. But if the user has already created such a vector, it would then be trivial for the user to simply add the additional NA rows to data before calling prepData. Please let me know if I'm missing something (or if you have any other ideas!).
The second suggestion is definitely doable, I'll keep the issue open until I've added a parallelization option to fitHMM when retryFits>0.

Thanks again for the suggestions!

MarieAugerMethe commented 2 years ago

Brett, you are amazing!

Great for wish 2!!

For wish 1, I think it would indeed need a few more arguments, e.g. start time and frequency. Then yes, you'd need to have something under the hood that would create the sequence of desired observation times. Now we do it outside the function, as you are suggesting. But I'm just thinking that for the novice HMM user and R coders, it would be good to have the NAs done under the hood, as creating the sequence of desired time while easy for us, is not necessarily straightforward for all ;). If I get a moment, I can try to clone your repo, and make a suggestion ;)

bmcclintock commented 2 years ago

OK I've added the argument ncores to fitHMM. The way it works is that it first fits the model (based on the initial parameter values provided in Par0, beta0, and delta0) and then attempts the retryFits in parallel using ncores workers. After the retryFits are completed in parallel, the model with the lowest negative log-likelihood is then returned by fitHMM. Note that when retryFits attempts are performed in parallel, each attempt uses random perturbations of the initial model fit (i.e. they are not iteratively updated as when ncores=1).

This feature is currently only available in the develop branch (momentuHMM >= 2.0.0):

library(remotes)
install_github("bmcclintock/momentuHMM@develop")

And here's an example of it's usage:

library(momentuHMM)
ncores <- 4 # number of cores for parallel processing
retryFits <- 12 # number of attempts to improve likelihood using random pertubation
nbStates <- 2
stepDist <- "gamma" 
angleDist <- "vm" 
nbStates <- 2
mu0 <- c(20,70)
sigma0 <- c(10,30)
kappa0 <- c(1,1)
stepPar <- c(mu0,sigma0)
anglePar <- kappa0 
formula <- ~cov1+cos(cov2)

m <- fitHMM(data=example$m$data,nbStates=nbStates,dist=list(step=stepDist,angle=angleDist),
                       Par0=list(step=stepPar,angle=anglePar),formula=formula,retryFits=retryFits,ncores=ncores)

Note also that this will probably only be faster for more complex models that take a while to fit. For more complex models, the fastest way to explore the likelihood surface with lots of different initial values would probably be to start with the current best parameter estimates for Par0, beta0, and delta0 and set hessian=FALSE in nlmPar (or control if optMethod is not "nlm"). Please check it out when you get a chance -- any feedback is of course welcome!

Regarding wish 1, please go for it! However, I might suggest creating a separate data pre-processing function (to be used before calling prepData) as it would probably be pretty complicated to directly add this feature into the architecture of prepData.

RRTogunov commented 10 months ago

This may be by design, but when using retryFits and ncores, m$conditions$initialValues is taken from the first fitting of the model in retryFitsParallel (specifically fit on line 33 is used to generate Par0 on line 42, and fed into fitHMM on lines 53/54), which then gets stored within fitHMM under conditions$initialValues.
This was a problem for me because some more complex models wouldn't move away from Par0 (I think genuinely due to Par0, complexity, optMethod, or controls), the same issue seemed to happen when troubleshooting very simple models, but that was because of using retryFits in parallel.

bmcclintock commented 10 months ago

I'm not sure I totally understand the issue. You are correct that initialValues for the initial model fit is returned, but as long as retrySD is not set to zero for all parameters, then at least some of the parameters are randomly perturbed during model refitting in lines 50-63 (technically, the model is being "fit" again at the MLE's of the initial fit before perturbing them and then refitting at the perturbed parameter values). If any of the retryFits improves the log likelihood, then the MLE's of the best fit are returned.

Unless I'm missing something (in which case please provide a minimal reproducible example), my best guess is that the random perturbations are not improving on the initial fit, in which case it might make sense to adjust retrySD or other options (e.g. optMethod). Or perhaps the model simply appears to have converged?

RRTogunov commented 10 months ago

Thanks for the quick response. The issue is that different initialValues are stored depending on whether retryFits was sequentially or in parallel. Specifically, when fitted sequentially, the Par0 supplied are stored, but in parallel, as you said, the values of the MLE before perturbing are stored. Again, this may be desired behaviour, but it took me a while to debug why initial values weren't changing, .

Here is a small reproducible example:

library(momentuHMM)  # Latest version of develop branch
set.seed(1)
data <- example$m$data
Par0 <- list(step=c(c(20,20),c(20,20)))
m1 <- fitHMM(data=data,nbStates=2,dist=list(step="gamma"),Par0=Par0)
m2 <- fitHMM(data=data,nbStates=2,dist=list(step="gamma"),Par0=Par0,retryFits=2)
m3 <- fitHMM(data=data,nbStates=2,dist=list(step="gamma"),Par0=Par0,retryFits=2,ncores=2)

identical(m1$conditions$initialValues$Par, Par0)  # TRUE
identical(m2$conditions$initialValues$Par, Par0)  # TRUE
identical(m3$conditions$initialValues$Par, Par0)  # FALSE

# exploring change in estimated Par from initial Par
plot(m2$conditions$initialValues$Par$step, getPar0(m2)$Par$step) 
plot(m3$conditions$initialValues$Par$step, getPar0(m3)$Par$step)

bmcclintock commented 10 months ago

Thanks for the example. I still don't understand why this is an issue/problem, but initialValues will usually be different when doing retryFits in parallel vs sequential (unless they are both initialized from a local maximum or the global maximum). This is because when done in parallel, fitHMM first fits the initial model and then uses the MLEs as the initial values for the parallel fits (which are then perturbed).

RRTogunov commented 10 months ago

My assumption would have been that initialValues would be the initial values specified by the user before fitting the model. This seems to be the case when retryFits =0 or ncores =1. I would have expected retryfits in paralell and sequential to use the same method for perturbing initial Pars (either both using user Par0 or both using Pars form initial fit). The problem was I didn't know the behaviour is different when refitting in parallel (as you mentioned, perterbing Pars from initial fit rather than perterbing the user-specified initial Pars), which gave me some confusing results when comparing initial Pars with estimated ones.

bmcclintock commented 10 months ago

They both fit an initial model and then perturb the MLEs from the initial fit. When done in parallel, only the MLEs from the initial fit are perturbed. When done sequentially, the MLEs from the fit with the smallest negative log-likelihood are perturbed (or, if the initial fit fails, it will perturb the user-specified initial values). So they both use the initial values provided by the user for the initial fit, and the initial values provided by the user are only perturbed if: 1) the initial fit fails when done sequentially; or 2) the initial values happen to be at a local maximum or the global maximum. When done in parallel, fitHMM is called on each worker (and hence the initialValues returned are from the initial fit instead of the initial values provided by the user). It was set up this way to ensure the user-specified initial values yield a finite log-likelihood before sending the job to the workers.

RRTogunov commented 10 months ago

Thank you for the clarification. That implementation of perturbations using MLE from an initial fit makes sense. I suppose I still don't understand why when fit sequentially,it is the user-specified initial values that are saved to .$conditions$initialValues (this is the expected behaviour for me), but when fit in parallel, the MLE values are saved to .$conditions$initialValues, especially if in both cases they perturb MLE from the initial fit. In my example code above, I expected .$conditions$initialValues$Par to be Par0 in all instances.

bmcclintock / momentuHMM

Wishes - parallel + NAs #158