ANGSD / angsd

Program for analysing NGS data.
231 stars 51 forks source link

Best way to calculate the probability that one sample carries one allele #250

Open biozzq opened 5 years ago

biozzq commented 5 years ago

Dear all,

Now, to trace the allelic trajectories in ancient genomes, we usually want to statistic the precise genotypes of some loci which played a more important role in adaption evolution. However, as the low sequencing depth of ancient samples and the unknow causal variants in these loci, it will be difficult for me to give a precise genotype. Through the published literatures, I found that the probability that one sample carries one allele has been used during their analysis. I want to imitate them; however, some problem occurred to me. Hope you can give me some suggestions.

1, genotype likelihood (GL) or genotype probabilities (GP), which one could be the best estimation? From the VCF generated using ANGSD, I found the GL can be more consist with the alignment of the reads, such as following: DP:AD:GP:GL 2:0,2:0.000000,0.675811,0.324189:-7.799987,-0.602068,0.000000 the GP field suggests 0/1 as the most likely genotype. The GL filed shows 1/1 is the most probable (Only two reads support the "1" allele, so GL can be more accurate). Such cases seem to occur with really low read depths and genotype qualities. The command used to generate the VCF like following angsd -b $bam -remove_bads 1 -uniqueOnly 1 -ref reference.fa -out $out -doMaf 1 -skipTriallelic 1 -doMajorMinor 1 -GL 1 -doCounts 1 -P 4 -doGlf 2 -minQ 20 -minMapQ 25 -minMaf 0.05 -doVcf 1 -doPost 1

2, As the GLs are computed for each different genotype (0/0, 0/1, 1/1), if I want to know the probability of one allele (0 or 1) at this site, which is the best way? Can I use the GL of 0/0 plus the half of 0/1?

kind regards, Zhuqing

ANGSD commented 5 years ago

Do you have multiple individuals?

biozzq commented 5 years ago

Dear @ANGSD

Yes, I have multiple individuals for ancient samples and I ran ANGSD using bam list including all samples.

Best regards, Zhuqing

biozzq commented 5 years ago

Dear @ANGSD,

I hope you can take some time to help me clear this problem. Thank you very much.

Kind regards, Zhuqing

z0on commented 5 years ago

would it be posterior probability of not being a heterozygote?.. as in, -doGeno 8 and simply sum up probabilities of two homozygous? (sounds too easy, apparently i'm missing something)

On Oct 11, 2019, at 4:27 AM, biozzq notifications@github.com wrote:

Dear @ANGSD,

I hope you can take some time to help me clear this problem. Thank you very much.

Kind regards, Zhuqing

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nspope commented 5 years ago
  1. You are essentially asking why genotype probabilities differ from genotype likelihoods. The reason is that genotype probs can utilize external information via a prior. If a uniform (uninformative) prior is used, then GP_i = exp(GL_i)/sum_j exp(GL_j) by Bayes' rule, and the two are equivalent. However, this is rarely a good idea for low depth data, because by definition the data at a single sample/site contain little information about the true genotype.

It is better to incorporate population-level information. The two ways that ANGSD can do this are both Empirical Bayes (= using priors estimated from the entire data). First, via estimated frequency of that particular allele, which effectively shares information across samples at a given site. Second, via the estimated site frequency spectrum, effectively sharing information across multiple samples and sites. You can get more details in the Nielsen https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037558 et al 2012 PLoS paper https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037558

The other accurate methods of calling genotypes that I know of (e.g. Maruki & Lynch 2017 https://www.ncbi.nlm.nih.gov/pubmed/28108551) also incorporate population-level information. I repeat, probably not a good idea to use an uninformative prior on individual genotypes with low-depth data, even if you are going to be working downstream with genotype probs and not hard calls.

  1. As z0on says, once you have the genotype probabilities in hand, it's just summation (to make probabilistic statements about counts in a single sample) or convolution (to make probabilistic statements about counts across multiple samples). I am not really clear about what you are after, but based on context it sounds like the probability that a given sample has at least one copy of a derived allele ... that is Pr(homozygote for derived allele) + Pr(heterozygote).

Nate

nspope commented 5 years ago

Or, perhaps you mean the probability that a single copy drawn at random from the sample carries the derived allele? ... that is Pr(homozygous for derived allele) + 0.5 Pr(heterozygous)