Closed claudiuskerth closed 5 years ago
Hi Claudius, the rmtriallelic calculates the pvalue of the site being triallelic by calculating the frequencies using the genotype likelihoods. The code for calculating the lrt for which the pvalue is calculated is found here: https://github.com/ANGSD/angsd/blob/fd1af06104fb488c2099dde057fa43f8bcdb6639/abcFreq.cpp#L959
The skipTriallelic is only used when you force a major and minor from external fastafiles. I dont think this would change anything but some of the saf based analysis.
Hi Thorfinn,
many thanks for those hints!
The skipTriallelic is only used when you force a major and minor from external fastafiles.
Does that mean -skipTriallelic
is ineffective unless -domajorminor [3,4,5]
is specified? Does this mean I need to know which of [A, C, T, G] is the major and minor allele at each site in advance in order to detect triallelic sites? In Matteo's tutorial, I see command lines that use -doMajorMinor 1 -doMaf 1 -skipTriallelic 1
. Would that be correct or is -doMajorMinor 1
getting the major and minor alleles info?
If I am not mistaken, the function first uses an EM algorithm to estimate the population frequencies of all 4 possible nucleotides at a site. It then returns the D statistic of the LRT between "allowing more than two nucleotides to have frequency > 0" and "allowing only two nucleotides to have frequency > 0". The p-value must be calculated elsewhere. Is it printed out somewhere or used to filter sites (if yes, which p-value cutoff is used)?
I was hoping I could use -skipTriallelic
to detect incorrect assembly of my reference, i. e. collapsing of duplicated and divergent sequences. I would like to be able to exclude regions where non-biallelic sites are found (assuming that apparently non-biallelic sites are due to misassembly and mismapping and otherwise very rare).
many thanks for your help
claudius
Yes, the skip triallelic, does this:
int a=refToInt[pars->anc[s]];
int b=refToInt[pars->major[s]];
int c=refToInt[pars->minor[s]];
if(a!=b&&a!=c)
pars->keepSites[s]=0;
So if none of the alleles given by major and minor (estimated with -doMajorminor), is not the ancestral the site will be skipped.
When you run with -rmTriallelic then you supply a pvalue, and that value is used for discarding the site. Internally in the code it calculates the LRT which is transformed to a pvalue assuming chi-square distribution. Hope this helps, otherwise write back.
Best
On Thu, Oct 19, 2017 at 12:22 PM, Claudius Kerth notifications@github.com wrote:
Hi Thorfinn,
many thanks for those hints!
The skipTriallelic is only used when you force a major and minor from external fastafiles.
Does that mean -skipTriallelic is ineffective unless -domajorminor [3,4,5] is specified? Does this mean I need to know which of [A, C, T, G] is the major and minor allele at each site in advance in order to detect triallelic sites? In Matteo's tutorial https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md, I see command lines that use -doMajorMinor 1 -doMaf 1 -skipTriallelic 1. Would that be correct or is -doMajorMinor 1 getting the major and minor alleles info?
If I am not mistaken, the function first uses an EM algorithm to estimate the population frequencies of all 4 possible nucleotides at a site. It then returns the D statistic of the LRT between "allowing more than two nucleotides to have frequency > 0" and "allowing only two nucleotides to have frequency > 0". The p-value must be calculated elsewhere. Is it printed out somewhere or used to filter sites (if yes, which p-value cutoff is used)?
I was hoping I could use -skipTriallelic to detect incorrect assembly of my reference, i. e. collapsing of duplicated and divergent sequences. I would like to be able to exclude regions where non-biallelic sites are found (assuming that apparently non-biallelic sites are due to misassembly and mismapping and otherwise very rare).
many thanks for your help
claudius
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/113#issuecomment-337878500, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7g-nisj0unLP50vbkUcTtDuNaObpks5stzD6gaJpZM4P2EOI .
Ah, thanks Thorfinn, that's much clearer now.
To rephrase would you said, the supplied ancestral sequence (I use the reference as ancestral) needs to contain the major or the minor allele. That's no problem and makes sense. So rmTriallelic
can be used for filtering when used with -domaf
.
many thanks for your help!
claudius
-domajorminor
has a skiptriallelic. How is that different from the rmtriallelic of domaf
? Wouldn't it be good to give -dosaf
a rmtriallelic option as well?
claudius
Hello Claudius, Yes, I agree. That makes sense. Ill make an issue regarding this, so I dont forget.
Thanks
Hi, I did
angsd -bam /home/data/madzays/finch_data/combruns_BF_genomes/recal_read_bams/bam.list -dovcf 1 -gl 1 -doGeno 6 -dopost 1 -doMaf 1 -doMajorMinor 1 -minMaf 0.1 -SNP_pval 1e-6 -out /home/data/madzays/finch_data/combruns_BF_genomes/recal_read_bams/angsd/RSFV1_doGeno6_recal_reads
but didnt include -skipTriallelic 1 Does this mean that my vcf has triallelic sites? Would there be a way of filtering out from the vcf? Thanks
Dear @madzayasodara, yes from the command you would have included triallelic sites in your analysis. @claudiuskerth the building of the sample allele frequencies in the abcSaf.cpp loops over all possible nonancestral alleles so it should take into account sites that are triallelic.
Im closing this issue, feel free to reopen if needed.
Best
Hi,
what does the switch
-skipTriallelic
do exactly, i. e. how does ANGSD detect triallelic SNP's? Does it calculate L(D | <A1, A2, A3>) or does it just discard all sites where three different bases are found no matter their quality?cheers
claudius