WGLab / RepeatHMM

a hidden Markov model to infer simple repeats from genome sequences
Other
34 stars 14 forks source link

Reproducibility Question #12

Closed andyb3 closed 6 years ago

andyb3 commented 6 years ago

Hi. Thanks for making this great tool. We notice that when running the same BAM through repeatHMM multiple times, the repeat counts in the output can vary slightly each time. Which step(s) of the process introduce this variability and fo you know if there are any settings we can tweak so that we get reproducibile results?

Happy to share the commands we are using if that helps.

Thanks, Andy

liuqianhn commented 6 years ago

Hi @andyb3 , thank you for being interested in our tool. Sorry for late reply.

The slight difference would be caused by GaussianMixture for detecting peaks in a list with not very sharp peak of Gaussian distribution(For those distribution with a very clear peak, GaussianMixture generally gives a same peak detection each time). GaussianMixture needs a random initialization and thus detects peaks with slight difference. One way to avoid this difference would be to give a seed in random initialization. Feel free to let me know how this slight difference affect your analysis. Thank you.

andyb3 commented 6 years ago

Thanks for your response! That is very helpful. Is there a way to supply a seed to the GaussianMixture through the repeatHMM settings or would this require tweaking of the code?

liuqianhn commented 6 years ago

Hi @andyb3 , I am afraid that there is no parameter for this seed setting now. If you want to remove this difference, a simple way is to try to add seed setting (for example np.random.seed(1)) just after line 157 of bin/scripts/myGaussianMixtureModel.py. Feel free to let me know if it does not work and any feedback. Thank you.

andyb3 commented 6 years ago

Ah OK. I will try that! Thanks again for your help. Question answered so I will close the issue.