ANGSD / angsd

Program for analysing NGS data.
230 stars 50 forks source link

-skipTriallelic #113

Closed claudiuskerth closed 5 years ago

claudiuskerth commented 7 years ago

Hi,

what does the switch -skipTriallelic do exactly, i. e. how does ANGSD detect triallelic SNP's? Does it calculate L(D | <A1, A2, A3>) or does it just discard all sites where three different bases are found no matter their quality?

cheers

claudius

ANGSD commented 7 years ago

Hi Claudius, the rmtriallelic calculates the pvalue of the site being triallelic by calculating the frequencies using the genotype likelihoods. The code for calculating the lrt for which the pvalue is calculated is found here: https://github.com/ANGSD/angsd/blob/fd1af06104fb488c2099dde057fa43f8bcdb6639/abcFreq.cpp#L959

The skipTriallelic is only used when you force a major and minor from external fastafiles. I dont think this would change anything but some of the saf based analysis.

claudiuskerth commented 7 years ago

Hi Thorfinn,

many thanks for those hints!

The skipTriallelic is only used when you force a major and minor from external fastafiles.

Does that mean -skipTriallelic is ineffective unless -domajorminor [3,4,5] is specified? Does this mean I need to know which of [A, C, T, G] is the major and minor allele at each site in advance in order to detect triallelic sites? In Matteo's tutorial, I see command lines that use -doMajorMinor 1 -doMaf 1 -skipTriallelic 1. Would that be correct or is -doMajorMinor 1 getting the major and minor alleles info?

If I am not mistaken, the function first uses an EM algorithm to estimate the population frequencies of all 4 possible nucleotides at a site. It then returns the D statistic of the LRT between "allowing more than two nucleotides to have frequency > 0" and "allowing only two nucleotides to have frequency > 0". The p-value must be calculated elsewhere. Is it printed out somewhere or used to filter sites (if yes, which p-value cutoff is used)?

I was hoping I could use -skipTriallelic to detect incorrect assembly of my reference, i. e. collapsing of duplicated and divergent sequences. I would like to be able to exclude regions where non-biallelic sites are found (assuming that apparently non-biallelic sites are due to misassembly and mismapping and otherwise very rare).

many thanks for your help

claudius

ANGSD commented 7 years ago

Yes, the skip triallelic, does this:

int a=refToInt[pars->anc[s]];

  int b=refToInt[pars->major[s]];

  int c=refToInt[pars->minor[s]];

  if(a!=b&&a!=c)

    pars->keepSites[s]=0;

So if none of the alleles given by major and minor (estimated with -doMajorminor), is not the ancestral the site will be skipped.

When you run with -rmTriallelic then you supply a pvalue, and that value is used for discarding the site. Internally in the code it calculates the LRT which is transformed to a pvalue assuming chi-square distribution. Hope this helps, otherwise write back.

Best

On Thu, Oct 19, 2017 at 12:22 PM, Claudius Kerth notifications@github.com wrote:

Hi Thorfinn,

many thanks for those hints!

The skipTriallelic is only used when you force a major and minor from external fastafiles.

Does that mean -skipTriallelic is ineffective unless -domajorminor [3,4,5] is specified? Does this mean I need to know which of [A, C, T, G] is the major and minor allele at each site in advance in order to detect triallelic sites? In Matteo's tutorial https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md, I see command lines that use -doMajorMinor 1 -doMaf 1 -skipTriallelic 1. Would that be correct or is -doMajorMinor 1 getting the major and minor alleles info?

If I am not mistaken, the function first uses an EM algorithm to estimate the population frequencies of all 4 possible nucleotides at a site. It then returns the D statistic of the LRT between "allowing more than two nucleotides to have frequency > 0" and "allowing only two nucleotides to have frequency > 0". The p-value must be calculated elsewhere. Is it printed out somewhere or used to filter sites (if yes, which p-value cutoff is used)?

I was hoping I could use -skipTriallelic to detect incorrect assembly of my reference, i. e. collapsing of duplicated and divergent sequences. I would like to be able to exclude regions where non-biallelic sites are found (assuming that apparently non-biallelic sites are due to misassembly and mismapping and otherwise very rare).

many thanks for your help

claudius

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/113#issuecomment-337878500, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7g-nisj0unLP50vbkUcTtDuNaObpks5stzD6gaJpZM4P2EOI .

claudiuskerth commented 7 years ago

Ah, thanks Thorfinn, that's much clearer now.

To rephrase would you said, the supplied ancestral sequence (I use the reference as ancestral) needs to contain the major or the minor allele. That's no problem and makes sense. So rmTriallelic can be used for filtering when used with -domaf.

many thanks for your help!

claudius

claudiuskerth commented 7 years ago

-domajorminor has a skiptriallelic. How is that different from the rmtriallelic of domaf? Wouldn't it be good to give -dosaf a rmtriallelic option as well?

claudius

ANGSD commented 7 years ago

Hello Claudius, Yes, I agree. That makes sense. Ill make an issue regarding this, so I dont forget.

Thanks

madzafv commented 6 years ago

Hi, I did

angsd -bam /home/data/madzays/finch_data/combruns_BF_genomes/recal_read_bams/bam.list -dovcf 1 -gl 1 -doGeno 6 -dopost 1 -doMaf 1 -doMajorMinor 1 -minMaf 0.1 -SNP_pval 1e-6 -out /home/data/madzays/finch_data/combruns_BF_genomes/recal_read_bams/angsd/RSFV1_doGeno6_recal_reads

but didnt include -skipTriallelic 1 Does this mean that my vcf has triallelic sites? Would there be a way of filtering out from the vcf? Thanks

ANGSD commented 5 years ago

Dear @madzayasodara, yes from the command you would have included triallelic sites in your analysis. @claudiuskerth the building of the sample allele frequencies in the abcSaf.cpp loops over all possible nonancestral alleles so it should take into account sites that are triallelic.

Im closing this issue, feel free to reopen if needed.

Best