eveilyeverafter / HMMancestry

R package using the Forward-Backward algorithm to infer genotypes, recombination hotspots, and gene conversion tracts from low-coverage next-generation sequence data.
1 stars 0 forks source link

Maximum likelihood estimation for p.assign and p.trans parameters needed #16

Closed eveilyeverafter closed 9 years ago

eveilyeverafter commented 9 years ago

We talked earlier about creating a (better) method for estimating jointly the two parameters of the forward-backward function. Systematically going through parameter spaces can be really inefficient. What methods do you have in mind?

eveilyeverafter commented 9 years ago

Hill climbing methods would work well with the data that we have. However, other user data might not behave as nicely so having an option for systematic exploration of p.trans and probably of sequence error would be nice to implement. The problem is that this method is rather slow. To this end I'm going to work on implementing the FB function call into C++ and call the function from within R.

eveilyeverafter commented 9 years ago

This is a nice resource for using C++ code in R: http://adv-r.had.co.nz/Rcpp.html

eveilyeverafter commented 9 years ago

And to let Rcpp play nice with roxygen2 see this blog post: http://ironholds.org/blog/adding-rcpp-to-an-existing-r-package-documented-with-roxygen2/

eveilyeverafter commented 9 years ago

Code has been implemented. The C++ variant is between 50 and 100 times faster than the R variant when comparing 100 runs using microbenchmark. The results are the same save the R variant provides 1 more significant digit than the C++ variant.

During the updating we switched the algorithms (and simulations) so that p.trans is no longer used. Instead of p.trans (i.e., the per base pair transition probability) we have the "scale" parameter. scale is used to rescale the physical distance between snps to mapping distance. The mapping distance is used during the generation of the hmm transition matrix during both the forward and the backward algorithms; it is used in the haldane function.

For both the R variant and the C++ variant of the FB algorithm, the maximum lnL can be estimated for each parameter combination by summing up the lnL for each tetrad, spores, and chromosomes and picking the parameter combination with the maximum lnL.