eblerjana / pangenie

Pangenome-based genome inference
MIT License
114 stars 10 forks source link

Adapting Pangenie for Non-Human Species Genome Inference #59

Open LeoHongboWANG opened 11 months ago

LeoHongboWANG commented 11 months ago

Dear Pangenie team,

I am pleased to express my gratitude for your work on Pangenie's development.

For my specific project, I have some questions regarding genome inference. My species has two populations with different recombination rates and effective population sizes. Furthermore, I have noticed that different software programs produce different effective population sizes. Upon examining the Pangenie source code, I found that it uses values of 1.26 and 25,000. I am curious as to the origin of these data. Regarding my project involving species with varying population sizes and recombination rates, would you recommend using an average value for each group or different values for different groups? I would appreciate any guidance you can provide to help me make the most of Pangenie for my research. Best, Hongbo

eblerjana commented 10 months ago

Hi Hongbo,

we have not done any experiments with different recombination rates as we only worked with human data. The effective population size parameter was set experimentally by trying a range of different values and analyzing which one gives the best results on several human datasets. The problem is that in our specific model, the transition- and emission probabilities need to be calibrated to ensure one is not dominating the other. For this reason, an effective population size of 0.00001 turned out to give the best results, that is why this parameter is currently used as default value in PanGenie. We haven't run PanGenie on other species (it was mainly designed for human, but can in principle be used for all diploid species). If you have ground truth data, one way to proceed would be to check different parameters and how they effect the genotyping results.

Best, Jana

LYC-vio commented 10 months ago

Dear Pangenie team, Thank you for your detailed explanation. Just a quick follow up, I've noticed the following line in src/transitionprobabilitycomputer.cpp

long double distance = (to_variant - from_variant) * 0.000004L * ((long double) recomb_rate) * effective_N

and according to line 38 in src/hmm.hpp

... long double effective_N = 25000.0L,  ...

The default value for Ne is 25000. Would you please explain more about the "effective population size of 0.00001"?

Thank you very much

Best, Yichen

eblerjana commented 10 months ago

The effective population size is set to 0.00001, see: https://github.com/eblerjana/pangenie/blob/7c171f38d507ac967b45c868ce7c166e56cbf3b9/src/pangenie-genotype.cpp#L32C16-L33C37. 25000 is the default value implemented in the HMM constructor, but PanGenie always calls this constructor providing a value of 0.00001. The reason for this is explained above in my previous answer.