Correction and Scaling of Diversity Estimators in Theta Sliding Window Output (pestPG file)

Callithrix-omics commented 4 years ago

Hi, I have two questions:

Could you please clarify how theta diversity estimators such as nucleotide diversity and Watterson's Theta are scaled in the pestPG output file after you do a sliding window analysis of out.thetas.idx ?

On the page http://www.popgen.dk/angsd/index.php/Thetas,Tajima,Neutrality_tests#Example_Output it says

Output in the ./thetaStat print thetas.idx are the log scaled per site estimates of the thetas
Output in the pestPG file are the sum of the per site estimates for a region

Which base of the log scale was used, if you want to convert back to linear values.

Also, looking thru some recent publications I see that some authors correct theta estimators like nucleotide diversity by dividing "the sum of per-site π by the number of variant and invariant sites in a given window." (e.g., https://onlinelibrary.wiley.com/doi/full/10.1111/mec.15401). In the pestPG output file, could you get this correction for nucleotide diversity, for example, by dividing tP by nSites?

regards and thank you.

clairemerot commented 4 years ago

Maybe, this could help: The theta Watterson or theta Pi per window should be divided by the total number of sites (last column in the full theta output). I followed what is done in : https://www.biorxiv.org/content/10.1101/2020.06.27.175091v1.full.pdf The values make more sense in my dataset. Cheers Claire

Callithrix-omics commented 4 years ago

I ended up doing that too. thank you for the suggestion.

ANGSD commented 4 years ago

Yes that it definitely a good approach. I would be carefull if you have very varying coverage along the genome. Like capture or radseq data.

On 26 Aug 2020, at 19.02, Joanna Malukiewicz notifications@github.com wrote:

I ended up doing that too. thank you for the suggestion.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/329#issuecomment-681005380, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3SHOYCPMVY6CP6Y5MTSCU5YRANCNFSM4ORXGRUA.

BelenJM commented 3 years ago

Hi, I calculated some theta and pi estimates in ANGSD out from some capture data, with medium coverage. @ANGSD Can you elaborate more on why varying coverage along the genome could bias these estimates?

ANGSD commented 2 years ago

If you are considering the a 5mb region, but you only have data for one site, then the estimate for that site will be dominated by the estimate of that single site.

ekhowell commented 2 years ago

Hi, I just wanted to follow up on the question @Callithrix-omics asked– are the per-site diversity estimates contained in the .thetas.idx/.thetas.gz output scaled using log base 10 or base e? Thanks!

nspope commented 2 years ago

Seems to be natural log, here is relevant line in source

ANGSD commented 1 year ago

Natural log.

On 16 May 2022, at 18.51, Emma Howell @.***> wrote:

Hi, I just wanted to follow up on the question @Callithrix-omics https://github.com/Callithrix-omics asked– are the per-site diversity estimates contained in the .thetas.idx/.thetas.gz output scaled using log base 10 or base e? Thanks!

— Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/329#issuecomment-1127905066, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3UUZK6QXKKZDAO3QKDVKJ4IJANCNFSM4ORXGRUA. You are receiving this because you modified the open/close state.

ANGSD / angsd

Correction and Scaling of Diversity Estimators in Theta Sliding Window Output (pestPG file) #329