ANGSD / angsd

Program for analysing NGS data.
231 stars 51 forks source link

Calculating FST: when to fold SFS #259

Closed TonyKess closed 4 years ago

TonyKess commented 5 years ago

Hello,

I'm using ANGSD to calculate FST in sliding windows, working from the species reference genome rather than using an ancestral genome. My question is: at what stage in using the realSFS command should I be inputting the -fold 1 flag? Should I be using this flag at:

  1. The step in which I calculate a 2djoint SFS using realSFS?
  2. the realSFS index step?
  3. the realSFS stat/stat2 step?
  4. Some combination of the above steps?

Thanks for your help! Tony

clairemerot commented 5 years ago

Hello, I have the same question. :) I see that in August there has been an update in the github version to improve how realSFS deals with folded but I'm unsure how this works now.

Wiki says "We have therefore added a proper folding procedure for the optimization based on the UNFOLDED .saf.idx files generated by -doSaf. These are the ones that should be used for calculating fst. Therefore please remember to add -fold 1 if you want angsd (the realSFS subfunction) to perform fst and pbs estimation using the folded spectra."

My understanding is that we should make -doSaf (without the fold option) and then RealSFS A.saf.idx B.saf.idx -fold 1 > A-B.ml Is that right?

On a related note: 1- Does anyone know if filtering the sites involved in the saf (with the sites options - for instance to keep only polymorphic sites) is appropriate or should we keep all sites?

2- Should we used the fold options for the thetas? if so at which step?

Thanks a lot for the help! Claire

nspope commented 5 years ago

The reason this is confusing is because of the multiple steps/inputs in Fst estimation. Briefly, it used to be: (A) generate folded SAF with ANGSD, (B) folded SAF used to generate folded SFS, (C) folded SAF and folded SFS used to calculate Fst in windows.

However, this didn't result in the correct likelihood for 2d spectra, as the folding was done at the single-population level (& not for pairs of populations). So: in August, realSFS was modified so as to use the unfolded SAF when estimating the folded SFS and Fst. The new workflow is:

  1. BAMs used to generate unfolded SAF (do NOT use -fold option)
  2. unfolded SAF used to generate folded SFS (use -fold in realSFS)
    realSFS pop1.unfolded.saf.idx pop2.unfolded.saf.idx -fold 1 >folded.sfs
  3. unfolded SAF and folded SFS used to generate per-site numerator/denominator of Fst (use -fold in realSFS fst index)
    realSFS fst index pop1.unfolded.saf.idx pop2.unfolded.saf.idx -sfs folded.sfs -fold 1 -fstout persite
  4. sum numerator and denominator in windows
    realSFS fst stat2 persite.fst.idx -win XXXX -step XXXX >window.fst
nspope commented 5 years ago

Claire,

Generally speaking you shouldn't use any filter that will distort the distribution of allele frequencies or remove monomorphic sites (minMAF and SNP_pval, I'm looking at you). This will hugely distort the SFS. It is however a good idea to remove sites that collapsed paralogs or contamination.

I will have to check wrt persite theta estimation... you do need to specify folding, but I am not clear how/if this changed with the last commit.

clairemerot commented 5 years ago

Thanks a lot for the precisions on the update for the folding issue! We are updating angsd and will try again the fst.

About the filters on allelic frequencies, the choice is difficult because we are at very low coverage (1x) so we were unsure whether low-frequency variants were reliable. Hence, the use of minMAF.

It also saves a lot of time and memory.

I guess that without filters I need to use the option -nSites 10000000 in the realSFS command. Yet, if I remember well, when using this option the output ".sfs" is not directly usable by realSFS fst index because it has more than one line. Is it still correct to take the sum of all lines (per column) as explained here https://github.com/ANGSD/angsd/issues/69

Thanks a lot for the support !

nspope commented 5 years ago

Hi Claire,

the method used by ANGSD is designed for very low-depth data. As long as you use a sufficient number of sites and reads are correctly mapped, you should recover an accurate SFS regardless of sequencing error. I would really not recommend filtering based on allele frequency for SFS related stuff. This will throw away a bunch of useful information and totally distort theta-like statistics.

Yes, you can sum across lines with output after -nsites. Using 10Mb per batch is probably fine unless your SFS is very large.

clairemerot commented 5 years ago

Thanks for the advice. We will re-run without maf filter and used nSites & sum. I'm sorry I have one (hopefully last ) question. Does doSaf take into accoun the -doMajorMinor? Or is each position always polarised the same way? If so how? (reference= major?) I'm concerned that when doing the saf on two different population the frequency is not necessarily calculated for the same allele? By the way, how is it for triallelic positions (which may have different alternative alleles in different populations...)? no worries for that?

Thanks a lot!! Claire

mnr006 commented 5 years ago

I'm also curious about potentially removing sites that deviate from HWE. I know there are filters to do this in angsd, but I'm not familiar enough to know how this might distort the SFS. For other analyses, we are removing these sites and running them separately to look for possible signs of selection. However, in trying to calculate Fst values should we be removing them or leaving them in?

Thank you again for your time!! Megan

TonyKess commented 5 years ago

Hi again,

Thanks for clarifying the steps required for calculating FST. I've managed to get this calculation to work... but not properly. I am now getting FST estimates (-0.004) that don't match values observed in pool data (0.11), or on a SNP chip (0.15). Surprisingly, the genotype likelihoods themselves appear consistent with this data - PCangsd selection scan stats are highly correlated with the pool FST, whereas correspondence between pool and ANGSD calculated FST is very poor. I'm guessing something is going wrong at the SFS stage. Here are the commands I am currently entering for calculating the SFS:

./angsd \
      -ref ref.fasta \
      -anc ref.fasta \
      -bam bams.txt \
      -out POP \
      -dosaf 1 \
      -GL 1 \
      -nThreads 20 \
      -minMapQ 30 \
      -minQ 20 \
      -minInd 10 \
      -doMajorMinor 1 \
      -doMaf 2 \
      -SNP_pval 2e-6 \
      -doCounts 1 \
      -dumpCounts 2 \
      -setMinDepth 10 \
      -uniqueOnly 1 \
      -remove_bads 1 \
      -only_proper_pairs 1 \
      -maxDepth 400 

And then calculating FST with:

angsd/misc/realSFS POP1.saf.idx POP2.saf.idx -fold 1 > POP1.POP2.sfs
angsd/misc/realSFS fst index  POP1.saf.idx POP2.saf.idx -sfs POP1.POP2.sfs -fold 1 -fstout POP1.POP2
angsd/misc/realSFS fst stats POP1.POP2.fst.idx -fold 1
angsd/misc/realSFS fst stats2 POP1.POP2.fst.idx  -win 20000 -step 20000 -fold 1 > POP1.POP2_fst20k

Any idea what's going wrong here?

z0on commented 5 years ago

Hi Megan -
I say keep out-of HWE sites, especially when analyzing multi-dimensional SFS (those would be the sites diverging between populations ==> Wahlund effect). Only remove sites that are potentially from lumped paralogs in the reference assembly - such sites show predominance of heterozygotes. Nate Pope wrote a handy script to do that (attached), which works with -doGeno 11 output. The following one-liner removes sites for which the posterior probability that heterozygote calls are the majority exceeds 0.75.

zcat mydata.geno.gz | python HetMajorityProb.py | awk "\$6 < 0.75 {print \$1\"\t\"\$2}" > allSites

then,

angsd sites index allSites

and use -sites allSites in subsequent calls to ANGSD.

cheers Misha

On Oct 24, 2019, at 6:55 PM, mnr006 notifications@github.com wrote:

I'm also curious about potentially removing sites that deviate from HWE. I know there are filters to do this in angsd, but I'm not familiar enough to know how this might distort the SFS. For other analyses, we are removing these sites and running them separately to look for possible signs of selection. However, in trying to calculate Fst values should we be removing them or leaving them in?

Thank you again for your time!! Megan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/259?email_source=notifications&email_token=ABZUHGEJ5ABQJO7CMA64RVDQQIYXFA5CNFSM4I67OZUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECGYOQA#issuecomment-546146112, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGDVQBRS4UPUIMMOEZTQQIYXFANCNFSM4I67OZUA.

clairemerot commented 5 years ago

@TonyKess Have you solve the problem of getting negative FST? I am having the same strange results with the new procedure while it was fine before?

TonyKess commented 5 years ago

@clairemerot I haven't had much luck yet, I continue to get negative Fst values that don't match results from other methods, regardless of whether I include or exclude the -fold flag at the realSFS stage. I'm currently testing whether either removing the doMajorMinor flag or setting it to 5 to force inferring alleles from the "ancestral" genome makes a difference.

clairemerot commented 5 years ago

Thanks, I am trying different options on my side too, and I have opened a new issue about that (I am also getting nan in FST. It used to be sensible value fo fst before, like with the angsd version from 6 months ago and fol in saf but not realSFS... So wondering what happens? Let's be in touch if we find a solution! Thanks

ANGSD commented 4 years ago

Ok, so it looks like the original question was answered by @nspope and the additional questions has been submitted as new issues, so I am closing this one. Feel free to reopen if needed.