ANGSD / angsd

Program for analysing NGS data.
230 stars 50 forks source link

Separate pops or all-in-one while doMaf #56

Closed wlz0726 closed 7 years ago

wlz0726 commented 7 years ago

Hi, I have 5 domestic populations and 1 wild population While I doMaf to get SNP position, should I do this separately (in to 6 pops) Or use all individuals in one single bam list (all in One pop) Or Just use 2 pops(wild and domestic)

Is it a big difference in angsd?

I assume inner population structure will bias the GL estimate process and the LRT test of SNPs, and I should always do it separately (each population)?

Am I right?

Thanks.

biozzq commented 7 years ago

Hi @ANGSD

I have the same questions. We really need your help, Thanks.

Best

ANGSD commented 7 years ago

It depends on what analysis you are interested in afterwards. The gls are calculated independently per sample. If you are interested in per population analysis you should of course do the analysis per pop. If you are intersted in doing multipopulation you should use all pops.

wlz0726 commented 7 years ago

Hi @ANGSD , Thanks for the reply.

I'd like to add few things.

I have some populations with unbalanced sample size (with low to median sequencing depth), for example: 10 samples in PopA and 20 in PopB, 50 in PopC, 50 in PopD.

I'd like to do some population based (right?) SNP filtering such as HWE, MAF, sample Missing Percentage (nInd in mafs.gz), SB3, baseQ_Pval et al. . It will bias the results for population with small sample size (PopA) if I do it with all samples (PopA + PopB + PopC + PopD).

In my understanding, here is what I should do:

Is this a proper way?

Thanks

ANGSD commented 7 years ago

This sounds correct if you want to do snp and genotype calling. However many good analysis is angsd is based on the raw gls, and you might not need to do genotype calling

On Thu, Nov 24, 2016 at 12:43 PM, Lizhong Wang notifications@github.com wrote:

Hi @ANGSD https://github.com/ANGSD , Thanks for the reply.

I'd like to add few things.

I have some populations with unbalanced sample size (with low to median sequencing depth), for example: 10 samples in PopA and 20 in PopB, 50 in PopC, 50 in PopD.

I'd like to do some population based (right?) SNP filtering such as HWE, MAF, sample Missing Percentage (nInd in mafs.gz), SB3, baseQ_Pval et al. . It will bias the results for population with small sample size (PopA) if I do it with all samples (PopA + PopB + PopC + PopD).

In my understanding, here is what I should do:

  • do the SNP calling in separate population
  • do the filtering in each population (HWE, MAF, nInd, SB3)
  • merge the SNP sites in different population (Overlaps), maybe need filter tri-allele, generate the "final SNP sites"
  • generate GL files based on this "final SNP sites" with -sites parameter
  • do phasing with beagle and get the Genotype data

Is this a proper way?

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/56#issuecomment-262757503, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7o6gX0mQW4pyhaOv8gYFu937KWU9ks5rBXh3gaJpZM4K6Oe2 .

wlz0726 commented 7 years ago

Yeah, I know that most follow up analysis of angsd/ngsTools are based on GLs (including the monomorphic sites as background, which do great help when compute posterior probabilities of allele frequencies or summary statistics).

I want to make sure that I'm doing the right thing when I need Genotypes. Now I'm more confident with that. thank you.

all the Best

ANGSD commented 7 years ago

super, Ill close this issue, feel free to reopen if needed.