fgvieira / ngsF

Estimation of per-individual inbreeding coefficients under a probabilistic framework
Other
20 stars 8 forks source link

Number of runs to avoid convergence to local maxima #24

Closed irbinveliz closed 1 year ago

irbinveliz commented 1 year ago

Hello Dr. Vieira,

I have read the tutorial and it says that ngsF should be run several times to avoid convergence to local maxima. My question is how many runs should be performed to avoid this. Should the seed argument be different for each run? (my guess is that no)

Additionally, after the certain number of runs are done, should we obtain the mean value for inbreeding for each sample?

For example, if I run ngsF (with the same parameters and seed) 10 times, I will have 10 output files with the different values for inbreeding coefficients. Then, I would have to put together the inbreeding values for each sample and get the mean value per sample, right?

Thank you in advance!

fgvieira commented 1 year ago

Yes, you should always run ngsF several times; how many depends on your data and how much time you can dedicate to it (always changing the seed value, otherwise you get the same results!).

You should do a mixture of approx and full EM, just full EM, using random and estimated initial values (check ngsF.sh); but it all depends how much time it takes and how much time you have.

Then just choose the one with the highest Lkl.

irbinveliz commented 1 year ago

Thank you so much for your quick reply!

My data consist of ultraconserved element data aligned for 80 individuals. I am mainly using ANGSD to calculate different parameters such as SFS and heterozygosity and for that I want to have into account the inbreeding coefficients for each sample.

What is Lkl? Is it the per-individual log-likelihood value from the parameters file? Related to this, how can I read the parameters file? I know it is a binary file but I have not managed to read it properly. I have used the command 'hexdump' but it does not display the file properly.

Another related thing. Is it needed to add the argument --calc_LRT? When I produced the GL for my samples I used the argument -domajorminor.

The commands that I am running are the following:

angsd -gl 1 -doglf 3 -domajorminor 1 -C 50 -minmapq 30 -minQ 20 -baq 1 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -bam bamlist.txt -out genolike.inbreeding -nThreads 20

ngsF --n_ind 80 --n_sites 1671492 --glf genolike.glf.gz --output filt.inbr.coef

fgvieira commented 1 year ago

Lkl is the final/total log-likelihood.

--calc-LRT is only if you want to test if your inbreeding estimated are different from zero.

irbinveliz commented 1 year ago

Great! Thank you for your time and help!