ANGSD / angsd

Program for analysing NGS data.
223 stars 51 forks source link

SFS blip at p=0.5 #156

Open z0on opened 6 years ago

z0on commented 6 years ago

My 2d-SFS all have a blip at p=0.5 (see attached), which I assume is caused by lumped paralogs problem (i.e., SNPs that always appear as heterozygotes). Use HWE filter, you say - but I cannot filter them out using HWE when analyzing all data together because I don't want to filter out SNPs fixed in each pop, and I cannot apply HWE within each pop because then my sites would not align for 2dSFS. Is there a solution?

Misha

sc12imlsm2_979447_c12_dadi.data_pop0_pop1_34_14.pdf

ANGSD commented 6 years ago

Dear Misha I agree that this bump at 0.5 is due to paralogs. I think the program ngsParalog is able to find these regions.

For human data i was able to remove the bump by discarding regions of low mappability.

Best

On Wed, Jun 13, 2018 at 4:29 AM, Mikhail V Matz notifications@github.com wrote:

My 2d-SFS all have a blip at p=0.5 (see attached), which I assume is caused by lumped paralogs problem (i.e., SNPs that always appear as heterozygotes). Use HWE filter, you say - but I cannot filter them out using HWE when analyzing all data together because I don't want to filter out SNPs fixed in each pop, and I cannot apply HWE within each pop because then my sites would not align for 2dSFS. Is there a solution?

Misha

sc12imlsm2_979447_c12_dadi.data_pop0_pop1_34_14.pdf https://github.com/ANGSD/angsd/files/2096410/sc12imlsm2_979447_c12_dadi.data_pop0_pop1_34_14.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7gqRE15JNWemYdDi9u-GyjTXHVKvks5t8HkegaJpZM4Ulb83 .

z0on commented 6 years ago

Hey! - can't we just have a simple filter against sites with higher-than-cutoff (say 0.5) percentage of heterozygotes? (or will it require genotype calling, god forbid?)

z0on commented 6 years ago

I think I have a seat-of-the-pants solution that at least partially helps against lumped paralogs:

Run ANGSD with -doSNPstats and filter out sites where heterozygote counts comprise more than 50% of all counts :

zcat sfsSites.snpStat.gz | awk '($3+$4+$5+$6)>0' | awk '($12+$13+$14+$15)/($3+$4+$5+$6)<0.5' | cut -f 1,2 >sites2do angsd sites index sites2do

Then obtain SAF for each population for these sites using angsd -doSaf 1 -sites sites2do

nspope commented 5 years ago

Thanks for your solution, Misha. I have been using a similar approach using genotype likelihoods probabilities (output by -doGeno) instead of raw counts:

For a given sample/chromosome/position, calculate the log probability that the sample is a heterozygote as the difference between log-sum-exp over heterozygous genotype likelihoods, and log-sum-exp over all genotype likelihoods -- this is just Bayes rule with a uniform prior over genotypes.

(EDIT Using raw genotype likelihoods like I originally recommended does not work well. This is because doing so is tantamount to assuming each site could have up to four alleles -- ignoring the whole -doMajorMinor bit -- and artificially inflates the expected number of heterozygotes. Instead use genotype probabilities.)

Across samples at a fixed chromosome/position, treat these probabilities as the success probabilities of independent Bernoulli trials (e.g. a "success" for a given sample means the sample is heterozygote). Then the total number of heterozygotes follows a Poisson-binomial distribution. The expected number of heterozygotes is just the sum of the "success" probabilities.

Then, one could (1) discard sites where the expected proportion of heterozygotes is >0.5, or alternatively (2) discard sites where there is a large probability that the proportion of heterozygotes is >0.5; e.g. where the upper tail prob of the corresponding Poisson-binomial is greater than some fixed threshold.

So for example (assuming genotype probabilities are output along with genotype calls, e.g. "-doPost 2 -doGeno 11"):

# requires python module from https://github.com/tsakim/poibin
# input: *geno.gz from -doGeno 11, read from stdin
echo '
#!/usr/bin/python
from sys import stdin, stdout
from numpy import array, array_split
from poibin import PoiBin
for line in stdin:
  line = line.strip().split("\t")
  chrom, pos, gl = line[0], line[1], array_split(array(line[4:], "float"), (len(line)-4)/4)
  pr_heteroz = array([x[2] for x in gl if not x[0] < 0.])
  num_heteroz = sum([int(x[0]) for x in gl if x[0]==1])
  h_expected = sum(pr_heteroz)/len(pr_heteroz)
  try:
    pois_binom = PoiBin(pr_heteroz)
    utail_prob = pois_binom.pval(len(pr_heteroz)/2+1)
  except:
    utail_prob = 'NaN'
  stdout.write("\t".join([chrom, pos, str(len(pr_heteroz)), str(num_heteroz), str(h_expected), str(utail_prob)]) + "\n")
' > heterozygotes.py 
# outputs: chrom, pos, number non-missing samples, number hard heterozygote calls,
#          expected prop heterozygotes, Pr(prop heterozygotes > 0.5)

zcat pop1.geno.gz | python heterozygotes.py | awk '$5 <= 0.5 {print $1"\t"$2}' > keep.pop1.sites
zcat pop2.geno.gz | python heterozygotes.py | awk '$5 <= 0.5 {print $1"\t"$2}' > keep.pop2.sites
sort <(cat keep.pop1.sites) <(cat keep.pop2.sites) | uniq -d > keepers.sites # intersection 
angsd sites index keepers.sites
# blah blah using -sites

No real clue if this is ultimately any better/worse than using genotype counts, but seems to work ok when depth is low & variable across samples.

EDIT I did some comparisons between this filtering strategy and filtering using called genotypes. This is shown in attached scatterplots, where the x-axis in all panels is the proportion of heterozygous genotypes using angsd to call the genotypes. The y-axes are alternative metrics -- (A) Misha's suggestion in his comment above (<0.5), (B) expected number of heterozygotes (<0.5), (C) probability of heterozygote majority (<0.75), (D) HWE filter (pval >0.05). Blue points fail the filter with called genotypes but pass the alternative filter, red points fail the alternative filter but pass the filter with called genotypes, and dark/light gray points pass/fail both filters. Unsurprisingly, using either expected number of heterozygotes or the upper tail probability does no worse than using called genotypes, and perhaps better by removing some sites that are borderline. het_filter2

(though, looking at ngsParalog, I prefer the underlying model there -- just is very slow with many samples).

z0on commented 5 years ago

pretty cool! thanks a lot

On Dec 16, 2018, at 8:17 AM, nspope notifications@github.com wrote:

Thanks for your solution, Misha. I have been using a similar approach using genotype likelihoods instead of raw counts:

For a given sample/chromosome/position, calculate the log probability that the sample is a heterozygote as the difference between log-sum-exp over heterozygous genotype likelihoods, and log-sum-exp over all genotype likelihoods -- this is just Bayes rule with a uniform prior over genotypes.

Across samples at a fixed chromosome/position, treat these probabilities as the success probabilities of independent Bernoulli trials (e.g. a "success" for a given sample means the sample is heterozygote). Then the total number of heterozygotes follows a Poisson-binomial distribution https://en.wikipedia.org/wiki/Poisson_binomial_distribution. The expected number of heterozygotes is just the sum of the "success" probabilities.

Then, one could (1) discard sites where the expected proportion of heterozygotes is >0.5, or alternatively (2) discard sites where there is a large probability that the proportion of heterozygotes is >0.5; e.g. where the upper tail prob of the corresponding Poisson-binomial is greater than some fixed threshold.

So for example (assuming genotype likelihoods are output as text, "-doGlf 4"):

requires python module from https://github.com/tsakim/poibin

echo ' from sys import stdin, stdout from numpy import array, array_split, exp from scipy.misc import logsumexp from poibin import PoiBin het = [False, True, True, True, False, True, True, False, True, False] # AT, AC, etc. in cols of glf for line in stdin: line = line.split("\t") chrom, pos, gl = line[0], line[1], array_split(array(line[2:], "float"), 10) pr_heteroz = exp(array([logsumexp(x[het]) - logsumexp(x) for x in gl if sum(x) < 0.])) h_expected = sum(pr_heteroz)/len(pr_heteroz) pois_binom = PoiBin(pr_heteroz) utail_prob = pois_binom.pval(len(pr_heteroz)/2+1) stdout.write("\t".join([chrom, pos, str(len(pr_heteroz)), str(h_expected), str(utail_prob)]) + "\n") ' > expected_prop_het.py

outputs: chrom, pos, num samples, expected prop heterozygotes, Pr(prop heterozygotes > 0.5)

zcat pop1.glf.gz | python expected_prop_het.py | awk '$4 <= 0.5 {print $1"\t"$2}' > keep.pop1.sites zcat pop2.glf.gz | python expected_prophet.py | awk '$4 <= 0.5 {print $1"\t"$2}' > keep.pop2.sites cat keep.*.sites | awk '![$0]++' > keepers.sites # intersection angsd sites index keepers.sites ... # blah blah No real clue if this is ultimately any better/worse than using genotype counts, but seems to work ok when depth is low & variable across samples.

Though, looking at ngsParalog, I prefer the underlying model there.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156#issuecomment-447607684, or mute the thread https://github.com/notifications/unsubscribe-auth/AHNDmOYW6B9iP1zgqVtCv6mEOc3cJiZ9ks5u5ZD8gaJpZM4Ulb83.

z0on commented 5 years ago

Sorry, a python dummy here: how do you install poibin? (pip install does not work as the git repo does not include setup.py …) Misha

On Dec 16, 2018, at 8:17 AM, nspope notifications@github.com wrote:

Thanks for your solution, Misha. I have been using a similar approach using genotype likelihoods instead of raw counts:

For a given sample/chromosome/position, calculate the log probability that the sample is a heterozygote as the difference between log-sum-exp over heterozygous genotype likelihoods, and log-sum-exp over all genotype likelihoods -- this is just Bayes rule with a uniform prior over genotypes.

Across samples at a fixed chromosome/position, treat these probabilities as the success probabilities of independent Bernoulli trials (e.g. a "success" for a given sample means the sample is heterozygote). Then the total number of heterozygotes follows a Poisson-binomial distribution https://en.wikipedia.org/wiki/Poisson_binomial_distribution. The expected number of heterozygotes is just the sum of the "success" probabilities.

Then, one could (1) discard sites where the expected proportion of heterozygotes is >0.5, or alternatively (2) discard sites where there is a large probability that the proportion of heterozygotes is >0.5; e.g. where the upper tail prob of the corresponding Poisson-binomial is greater than some fixed threshold.

So for example (assuming genotype likelihoods are output as text, "-doGlf 4"):

requires python module from https://github.com/tsakim/poibin

echo ' from sys import stdin, stdout from numpy import array, array_split, exp from scipy.misc import logsumexp from poibin import PoiBin het = [False, True, True, True, False, True, True, False, True, False] # AT, AC, etc. in cols of glf for line in stdin: line = line.split("\t") chrom, pos, gl = line[0], line[1], array_split(array(line[2:], "float"), 10) pr_heteroz = exp(array([logsumexp(x[het]) - logsumexp(x) for x in gl if sum(x) < 0.])) h_expected = sum(pr_heteroz)/len(pr_heteroz) pois_binom = PoiBin(pr_heteroz) utail_prob = pois_binom.pval(len(pr_heteroz)/2+1) stdout.write("\t".join([chrom, pos, str(len(pr_heteroz)), str(h_expected), str(utail_prob)]) + "\n") ' > expected_prop_het.py

outputs: chrom, pos, num samples, expected prop heterozygotes, Pr(prop heterozygotes > 0.5)

zcat pop1.glf.gz | python expected_prop_het.py | awk '$4 <= 0.5 {print $1"\t"$2}' > keep.pop1.sites zcat pop2.glf.gz | python expected_prophet.py | awk '$4 <= 0.5 {print $1"\t"$2}' > keep.pop2.sites cat keep.*.sites | awk '![$0]++' > keepers.sites # intersection angsd sites index keepers.sites ... # blah blah No real clue if this is ultimately any better/worse than using genotype counts, but seems to work ok when depth is low & variable across samples.

Though, looking at ngsParalog, I prefer the underlying model there.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156#issuecomment-447607684, or mute the thread https://github.com/notifications/unsubscribe-auth/AHNDmOYW6B9iP1zgqVtCv6mEOc3cJiZ9ks5u5ZD8gaJpZM4Ulb83.

TeresaPegan commented 3 years ago

Hi all, I was wondering whether people have had good success using the heterozygote frequency filter described in #294? How should the threshold for filtering be determined? I have this same problem with a small peak at p=0.5 and I was excited to see that there is a built in option now that can help deal with it! But I tried it using 0.5 as the threshold and it seems to have had a very large effect on my SFS for frequencies starting at about n/2. I have 39 individuals and in this folded SFS you will see that the values really get wacky starting at around 19. image

Before I applied the heterozygosity filter, the SFS looked nice and smooth without this steep drop and jaggedness (it's just that it also had that up-tilt at the end from the paralogs/mismapping).

I think I will play around with some different thresholds for the filter, although using trial and error feels kind of inexact.

Thanks! -Teresa

nspope commented 3 years ago

Hi Teresa, this is a simple filter that removes sites where the proportion of heterozygotes is high, which is a really common way to detect paralogs. The only difference (to this common practice) is that instead of estimating this proportion by counting homozygous vs heterozygous genotypes, here it's estimated from genotype likelihoods. It's unlikely that the algorithm is behaving badly (the underlying algo is from the HWE code that's been around forever) so I suspect the problem is your data. You can print out the estimated heterozygote proportion by setting the filter -maxHetFreq 1.0 that'll not filter any sites but will add this info to the HWE output file. Take a look at these, plotted against genomic coordinates, to get a sense for if there is something weird going on like large duplications or horrible mismapping or such. If you're using a very diverged reference genome, try being more stringent about filtering out reads with low mapping quality/no proper pairs/etc.

If you take a look and still suspect it's the algorithm, I'd suggest calling genotypes using e.g. GATK and comparing the heterozygote frequencies from these called genotypes with ANGSD's output (as described above). If these end up being really different, let me know.

z0on commented 3 years ago

I use Nate’s maxHetFreq filter all the time and never saw any weirdness... it does the job pretty well (removes the blip at p=0.5). That said I always down-project my SFS to 80% or less of total N, to smooth over jaggedness at high p, which is (apparently) the result of realSFS overfitting. I actually wonder if @Angsd is going to look at it any time soon...

On Mon, May 3, 2021 at 7:52 PM Teresa @.***> wrote:

Hi all, I was wondering whether people have had good success using the heterozygote frequency filter described in #294 https://github.com/ANGSD/angsd/pull/294? How should the threshold for filtering be determined? I have this same problem with a small peak at p=0.5 and I was excited to see that there is a built in option now that can help deal with it! But I tried it using 0.5 as the threshold and it seems to have had a very large effect on my SFS for frequencies starting at about n/2. I have 39 individuals and in this folded SFS you will see that the values really get wacky starting at around 19. [image: image] https://user-images.githubusercontent.com/35701568/116949649-e4bec200-ac50-11eb-86ac-814fc55cece1.png

Before I applied the heterozygosity filter, the SFS looked nice and smooth without this steep drop and jaggedness (it's just that it also had that up-tilt at the end from the paralogs/mismapping).

I think I will play around with some different thresholds for the filter, although using trial and error feels kind of inexact.

Thanks! -Teresa

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156#issuecomment-831625119, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGFINK4ENK4P65UWB7TTL5AMFANCNFSM4FEVX43Q .

-- cheers Misha matzlab.weebly.com

TeresaPegan commented 3 years ago

Thank you both for your insights! You are right, I don't think it's really an algorithm problem. (And in that case perhaps this discussion doesn't really belong on this thread, but I am not aware of an ANGSD users google group or other alternative -- please let me know if such a group exists!)

It seems like in my data, minor allele frequency is being mostly driven by heterozygosity? The reason I think so is that I've tried a few thresholds and there is a very clear effect of the threshold on where the folded SFS starts getting jagged. For example, above I show the result when I used a hetfreq filter of 0.5 where the folded SFS gets jagged at about element 19 (close to nx0.5; n=39 samples) and if I try a hetfreq filter of 0.85, I get a folded SFS that is smooth until it drops and becomes jagged at about element 33, which is roughly nx0.85. So, no matter where I put the threshold, it sharply decreases the number of sites with minor allele frequencies of (threshold x n) and above. image

I can't think of a good reason for my data to have so many heterozygotes, but it is certainly concerning, and thanks for the suggestions of things to look into. I am using a very closely-related reference genome. I tried what Nate suggested and plotted mean heterozygote frequency across genome windows for several of my largest chromosomes, but have not yet found any regions that stand out.

Thanks! -Teresa

p.s. I don't expect it to make any difference, but I am using @nspope's branch to use the heterozygosity filter, not the master version of ANGSD, because the more recent master versions of ANGSD will not run on my data without hitting segfaults -- that's a whole other can of worms. After using the nspope branch to make a site list based on the heterozgosity filter, I use the site list with realSFS on a SAF that I created with a stringent minInd filter (I only want sites covered by at least 1x for all individuals) in ANGSD version 0.930. I'm hoping to use the SFS for demography modeling.

nspope commented 3 years ago

I don't quite understand what you mean, but a big effect of this filter on the folded SFS outside of the last bin (the 0.5 bin) is most definitely suspicious. Have you checked for contamination in your library? Like, extractions that became accidentally pooled? Is sequencing depth unusually large at your highly heterozygous sites or very uneven across the genome, like you would expect if there's lots of (collapsed) duplications? Any chance there's polyploidy in your study organism?

If these are library/sequencing/mapping problems then honestly it might be easier to sort out if you first call variants using GATK or freebayes or something, rather than running stuff through ANGSD. Because it is often easier to visualize/troubleshoot genotype calls rather than genotype likelihoods, and because GATK (at least) calculates a number of useful quality metrics that ANGSD does not. One you sort out these issues, you can always come back and reestimate the SFS in ANGSD.

nspope commented 3 years ago

also just to follow up on your PS, using the nspope_bandedDP branch fixed the segfaults in #394?

TeresaPegan commented 3 years ago

No, the nspope_bandedDP branch did not fix my segfaults -- I did actually try it after seeing you mention on a different thread that it handles memory differently. I got a segfault.

However, your hetFilter branch lets me run things successfully that I cannot run in ANGSD 0.935. I have never tried the doSaf commands described in #394 with the hetFilter branch, but I successfully used the hetFilter branch to make site lists with the heterozygosity filter (with doMaf and doGlf and all that). When I tried to make a site list with the same code in ANGSD 0.935, I got segfaults.

Thanks! -Teresa

TeresaPegan commented 3 years ago

Hi all, Just as a quick update, in case it is of interest to those with these issues, I compared the performance of the ANGSD maxHetFreq option with the quality control filter provided by ngsParalog, and it seems like ngsParalog is a better option in my case. First, it appears to have removed many more sites with lower allele frequencies, which is probably desirable assuming these are indeed mismapped sites. Second, it produces a relatively smooth SFS, lacking the segmented pattern produced by the -maxHetFreq filter where there is a step-like drop downward corresponding to where I set the -maxHetFreq filter. This is shown in the figure below:

image

I don't mean to suggest there is anything wrong with the -maxHetFreq filter, it's a great functionality and may work well for other datasets, but it seems like in my case ngsParalog is producing a better output. That said, I am certainly no expert on SFSes and if anyone has comments on this, please let me know!

Here are some notes about my datasets, in case it's helpful. -- I am using 6 different bird species (so, no polyploidy). All 6 species' folded SFSes show a peak on the right side with a "raw" SFS and all 6 species show a step-like drop in their SFS when I used -maxHetFreq. (One example is shown above). -- My data are sequenced to about 3x. Roughly speaking, each of my 6 species was sequenced on a different lane, although I have done some lanes with a mix of species. -- I aligned my data with BWA-mem. Between my 6 species I am using three different reference genomes, each a close relative of the target species, so these issues are not inherent to a particular reference genome. -- Aside from the filters I have discussed, my "raw" SFS used ANGSD's minInd filter such that I included only sites covered by at least 1x for all individuals per species. It also uses -minQ 20 and -minMapq 20. -- Contamination is a worrisome possibility to explain high levels of heterozygosity in my data, but I think two patterns suggest against total contamination/mixing caused by index failure or something like that. First, I can detect geographic structure in samples all sequenced on the same lane. Second, I can de novo assemble clean mitochondrial genomes from individuals that were sequenced in mixed-species lanes, with no evidence of multiple species' haplotypes getting mixed up or appearing to show heteroplasmy. Still, it's hard to detect if there were low levels of contamination. In theory it should not be happening, of course.

nspope commented 3 years ago

ngsParalog is definitely a more principled approach to paralog detection/removal, glad it worked out for you!

z0on commented 3 years ago

Theresa - this is a great demonstration! can you please post some details on how you used ngsParalog to filter sites and make this SFS?

On May 11, 2021, at 1:23 PM, Teresa @.***> wrote:

Hi all, Just as a quick update, in case it is of interest to those with these issues, I compared the performance of the ANGSD maxHetFreq option with the quality control filter provided by ngsParalog https://github.com/tplinderoth/ngsParalog, and it seems like ngsParalog is a better option in my case. First, it appears to have removed many more sites with lower allele frequencies, which is probably desirable assuming these are indeed mismapped sites. Second, it produces a relatively smooth SFS, lacking the segmented pattern produced by the -maxHetFreq filter where there is a step-like drop downward corresponding to where I set the -maxHetFreq filter. This is shown in the figure below:

https://user-images.githubusercontent.com/35701568/117863580-eba59000-b261-11eb-8189-845c6e756f20.png I don't mean to suggest there is anything wrong with the -maxHetFreq filter, it's a great functionality and may work well for other datasets, but it seems like in my case ngsParalog is producing a better output. That said, I am certainly no expert on SFSes and if anyone has comments on this, please let me know!

Here are some notes about my datasets, in case it's helpful. -- I am using 6 different bird species (so, no polyploidy). All 6 species' folded SFSes show a peak on the right side with a "raw" SFS and all 6 species show a step-like drop in their SFS when I used -maxHetFreq. (One example is shown above). -- My data are sequenced to about 3x. Roughly speaking, each of my 6 species was sequenced on a different lane, although I have done some lanes with a mix of species. -- I aligned my data with BWA-mem. Between my 6 species I am using three different reference genomes, each a close relative of the target species, so these issues are not inherent to a particular reference genome. -- Aside from the filters I have discussed, my "raw" SFS used ANGSD's minInd filter such that I included only sites covered by at least 1x for all individuals per species. It also uses -minQ 20 and -minMapq 20. -- Contamination is a worrisome possibility to explain high levels of heterozygosity in my data, but I think two patterns suggest against total contamination/mixing caused by index failure or something like that. First, I can detect geographic structure in samples all sequenced on the same lane. Second, I can de novo assemble clean mitochondrial genomes from individuals that were sequenced in mixed-species lanes, with no evidence of multiple species' haplotypes getting mixed up or appearing to show heteroplasmy. Still, it's hard to detect if there were low levels of contamination. In theory it should not be happening, of course.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156#issuecomment-838952165, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGEWWFVYMMMGYR5F5LDTNFY37ANCNFSM4FEVX43Q.

TeresaPegan commented 3 years ago

Sure! I really just followed the instructions on the ngsParalog GitHub page. I used the calcLR function, I have not yet tried the dupHMM one. With the output of calcLR, I just basically copied the R code provided, which will "generate list of sites that don't show evidence of mismapping at 0.05 significance level." When I made my SFS, I used the R code's output as a -sites option, along with an un-filtered SAF file I had previously generated.

From a logistical perspective, note that it takes a pretty long time to run. I use ~1 Gb genomes, and I broke my genome into 7 roughly equal sized regions to run in parallel. All the parallel jobs had finished in about 24 hours. Of course, with more parallelizing it could go faster. So I basically did all of the steps described above 7 times and concatenated their output into one -sites file.

nspope commented 3 years ago

You should really only use variant sites for ngsParalog, it's a waste of time to input pileup for monomorphic sites (they carry no information and the I/O cost is tremendous). What I do is use GATK or ANGSD to get a list of likely variants, use samtools to get pileup for these, and pipe into ngsParalog. I've had mixed results with the included HMM for paralogous region detection, so I usually just filter on the basis of the LR scores.

Nate

z0on commented 3 years ago

Thanks! I would like to point out that filtering based on fdr-corrected p-values (as the ngsParalog github README is suggesting) is actually anti-conservative, ie you would likely under-filter the data and leave some real paralogs in there. I would filter based on raw pvalues with high cutoff, like 0.1, to make sure I definitely got rid of all or almost all of the paralogs (also removed ~10% of good sites but this is usually not a big deal).

Misha

On May 11, 2021, at 2:46 PM, Teresa @.***> wrote:

Sure! I really just followed the instructions on the ngsParalog GitHub page https://github.com/tplinderoth/ngsParalog. I used the calcLR function, I have not yet tried the dupHMM one. With the output of calcLR, I just basically copied the R code provided, which will "generate list of sites that don't show evidence of mismapping at 0.05 significance level." When I made my SFS, I used the R code's output as a -sites option, along with an un-filtered SAF file I had previously generated.

From a logistical perspective, note that it takes a pretty long time to run. I use ~1 Gb genomes, and I broke my genome into 7 roughly equal sized regions to run in parallel. All the parallel jobs had finished in about 24 hours. Of course, with more parallelizing it could go faster. So I basically did all of the steps described above 7 times and concatenated their output into one -sites file.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156#issuecomment-839069995, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGD5AS23PURZGOBCAQLTNGCQNANCNFSM4FEVX43Q.

z0on commented 3 years ago

but how then do you take the result (filter-passing variant sites) back to ANGSD to get the full SFS?

On May 11, 2021, at 2:51 PM, nspope @.***> wrote:

You should really only use variant sites for ngsParalog, it's a waste of time to input pileup for monomorphic sites (they carry no information and the I/O cost is tremendous). What I do is use GATK or ANGSD to get a list of likely variants, use samtools to get pileup for these, and pipe into ngsParalog. I've had mixed results with the included HMM for paralogous region detection, so I usually just filter on the basis of the LR scores.

Nate — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156#issuecomment-839074951, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGDSMJ6W5O6OSGVJDVTTNGDC3ANCNFSM4FEVX43Q.

nspope commented 3 years ago

use -sites with the entire list of sites (monomorphic + variant) excluding those that don't pass the filter. It's a tedious process, which is why filtering on basis of heterozygotes is still useful (if less principled)

edit: just to clarify, what I'm talking about is two passes through the data-- use angsd get a list of likely segregating sites (being anticonservative), run pileup for these sites through ngsParalog to get list of likely paralogs, then run through angsd again using -sites to get sfs or whatever else

TeresaPegan commented 3 years ago

Thanks to both of you for these comments, it's very helpful!

TeresaPegan commented 3 years ago

@nspope, if you don't mind, could you let me know how you get a site list that includes all sites (including all monomorphic) minus the ngsParalog filtered ones?

I thought I would be able to do this by generating: 1) a list of SNPs 2) a list of ngsParalog-filtered SNPs 3) a full list of sites covered in the dataset, including monomorphic sites

And then I planned to get the monomorphic sites from 3) and merge that list with the ngsParalog-filtered SNPs in 2).

It's step 3 that is causing me trouble. I thought that I would be able to use ANGSD to generate a beagle file of all covered sites, including monomorphic, along with their various statistics, by running some ANGSD code without a snp_pval or minmaf filter. I used this code:

$ANGSDIR/angsd -b $BAMS -out $OUT -doMaf 1 -doGlf 2 -doMajorMinor 1 -anc $ANC \
-minMapQ 20 -minQ 20 -doPost 2 -GL 2 -dosnpstat 1 -nThreads 4 -doCounts 1 -doDepth 1 \
-doHWE 1 -maxHetFreq 1.0 -SNP_pval 10.0

I used -maxHetFreq 1 and -SNP_pval 10.0 (way more than any p value should be) thinking this would return all of the sites in the genome that are covered and pass the mapQ and minQ filters, even those that are monomorphic. However, my output files clearly only contain SNPs because the number of sites in the beagle file matches the number of sites in the mafs file, for which all pk-em values are quite small (indicating they are all indeed SNPs).

Apparently something in my code snippet above is causing ANGSD to discard all of the monomorphic sites. Is it because I am using -doMaf? It's not clear to me from the ANGSD documentation which of these arguments would cause this behavior.

Do you use some ANGSD code to get the list of monomorphic sites to make your sites list? Or do you use something like mpileups?

Thanks! -Teresa

z0on commented 3 years ago

Interesting! did you try just omitting snp_pval and maxHetFreq? (that’s how I usually get all sites) Misha

On May 28, 2021, at 10:32 AM, Teresa @.***> wrote:

@nspope https://github.com/nspope, if you don't mind, could you let me know how you get a site list that includes all sites (including all monomorphic) minus the ngsParalog filtered ones?

I thought I would be able to do this by generating:

a list of SNPs a list of ngsParalog-filtered SNPs a full list of sites covered in the dataset, including monomorphic sites And then I planned to get the monomorphic sites from 3) and merge that list with the ngsParalog-filtered SNPs in 2).

It's step 3 that is causing me trouble. I thought that I would be able to use ANGSD to generate a beagle file of all covered sites, including monomorphic, along with their various statistics, by running some ANGSD code without a snp_pval or minmaf filter. I used this code:

$ANGSDIR/angsd -b $BAMS -out $OUT -doMaf 1 -doGlf 2 -doMajorMinor 1 -anc $ANC \ -minMapQ 20 -minQ 20 -doPost 2 -GL 2 -dosnpstat 1 -nThreads 4 -doCounts 1 -doDepth 1 \ -doHWE 1 -maxHetFreq 1.0 -SNP_pval 10.0 I used -maxHetFreq 1 and -SNP_pval 10.0 (way more than any p value should be) thinking this would return all of the sites in the genome that are covered and pass the mapQ and minQ filters, even those that are monomorphic. However, my output files clearly only contain SNPs because the number of sites in the beagle file matches the number of sites in the mafs file, for which all pk-em values are quite small (indicating they are all indeed SNPs).

Apparently something in my code snippet above is causing ANGSD to discard all of the monomorphic sites. Is it because I am using -doMaf? It's not clear to me from the ANGSD documentation which of these arguments would cause this behavior.

Do you use some ANGSD code to get the list of monomorphic sites to make your sites list? Or do you use something like mpileups?

Thanks! -Teresa

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/156#issuecomment-850500536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGGIK3OYOMNTPJ3SQLLTP6ZP3ANCNFSM4FEVX43Q.

TeresaPegan commented 3 years ago

Ok, apparently it was because I set the -SNP_pval at 10.0, which was a stupid thing to do (I just wanted to be sure I got all the sites even if some ended up with p values over 1 for any reason -- maybe impossible anyway though). When I set the -SNP_pval at 1.0, I got the same number of sites I get when I omit the snp_pval argument altogether, as you suggested, and I can see that most of these sites have pk-em = 1.000000. By contrast, setting -SNP_pval at 10.0 causes the SNPs to be filtered. I have no idea what the filter is, but the returned sites all have pk-em < 1.0000. I never would have guessed! Maybe something I am not understanding about floats. Thanks, -Teresa

nspope commented 3 years ago

Yeah edge case or not, that's not desirable. My guess is that it's because of this call to the chi-square quantile function

https://github.com/ANGSD/angsd/blob/c3c91e4de7c43885a4f4c7e78a38191fd7ed493b/abcFreq.cpp#L164

that will have a negative argument when SNP_pval > 1 ... and who knows what the implementation returns for negative numbers (it should return NaN). I'll fix it next time I put together a PR.