ANGSD / angsd

Program for analysing NGS data.
228 stars 50 forks source link

Questions regarding SFS_ancestral state #350

Closed HanXiaoEvo closed 4 years ago

HanXiaoEvo commented 4 years ago

Hi everyone,

I am trying to make an SFS using ANGSD but got a bit confused by the ancestral state.

My case is ddRAdSeq pair-end reads of a fish. Samples are generated as fq.gz for read1 and read 2 for each sample. I was told that I should use my reference genome as the ancestral state, but as the genome was made by a Canadian sample, I am wondering if it is the right way to use and if it is better than any reference populations from the same country as the samples to make SFS. If the reference population is better, how should I use both reads in fq?

Your opinions will be highly appreciated! Thank you very much!

Best regards, Han Xiao

z0on commented 4 years ago

Hi Han - a paradoxical thing is, to make a nice unfolded SFS (i.e., to know which allele is ancestral and which one is derived), the best choice of reference is not the genome of your study population and not even the genome of your study species, but a genome of a related sister species. This is called "polarizing". The key assumption here is that the majority of variants in your study species arose after the split form the sister species, and hence the SNP state that is different from sister-reference must be derived. This is not of course always true (because of incomplete lineage sorting) but the demographic inference methods typically can account for some small degree of misidentifying of ancestral state.

In your case the outlying population is definitely a better reference for polarizing, but it might not be divergent enough! So, to be on the safe side, you might wish to use folded AFS, in which case it is fine to use either Canadian or the individuals from your own population as reference.

cheers Misha

HanXiaoEvo commented 4 years ago

Thank you Misha and it helps a lot!

So to make a better unfolded SFS, as I am working on a salmonid fish, Arctic charr, using an Atlantic salmon genome should be a better idea, right? I was a bit worried that the charr populations I am working are so tightly related (sympatric morphs with Fst 0.05-0.02), so I was thinking using another charr reference population may help increase the resolution. But appreatly this is not the point of doing this ""polarizing". When I checked the human/cham example I was a bit confused about how divergent is relatively optimal for the case and I think you have answered it perfectly. Thank you again!

Cheers, Han

z0on commented 4 years ago

Hi Han - Yes, I would use Atlantic salmon as reference in this case. It is very unlikely that you would lose a non-trivial proportion of sites by doing that (check your mapping efficiency compared to same-species mapping though, just in case), so your resolution will be fine. Fst 0.02-0.05 is pretty high for RAD data actually, pop differentiation will be easy to see.

With unfolded SFS it would also be great to do all sorts of SFS-based demography (stairwayPlot, dadi / moments), which is super sensitive and informative about population processes.

Cheers Misha

On Tue, Sep 22, 2020 at 5:23 AM HanXiaoEvo notifications@github.com wrote:

Thank you Misha and it helps a lot!

So to make a better unfolded SFS, as I am working on a salmonid fish, Arctic charr, using an Atlantic salmon genome should be a better idea, right? I was a bit worried that the charr populations I am working are so tightly related (sympatric morphs with Fst 0.05-0.02), so I was thinking using another charr reference population may help increase the resolution. But appreatly this is not the point of doing this ""polarizing". When I checked the human/cham example I was a bit confused about how divergent is relatively optimal for the case and I think you have answered it perfectly. Thank you again!

Cheers, Han

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-696634546, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGBJV3MPJYZH7R4OWV3SHB3I3ANCNFSM4RU7DBMA .

HanXiaoEvo commented 4 years ago

Thank you Misha!!!!

I am quite new into it so I am wondering if I can just ask two more questions, not so relevant to ANGSD but choosing samples for demographic analyses in general.

  1. As the Fst is clear but weak (if I can say that), by doing simple PCA, structure-like plot I do observe somehow clustering pattern but there are some hybrids. It seems that people tend to exclude the hybrids because that these are not due to demographic process so may confound the results or what?

  2. Because I work with sympatic morphs (4 in total), for SNP calling for popG I didn't exclude markers out of HWE as there are still possible gene flows. However for the denmographic calculation, if I use the simple dosaf1 assuming HWE, what will be the impact for that? I will for sure check other options as to take inbreeding into account, but super cirous about such sconsiderations.

Thank you very much again!

Cheers, Han

TonyKess commented 4 years ago

Does the ancestral/outgroup genome need to be filtered only for sites that are in syntenty with the focal species? IE if I'm using the charr and salmon genomes, would I need to do a genome-genome alignment and only pick conserved regions first, or is any of that handled internally by ANGSD?

z0on commented 4 years ago

Hi Tony - Tbh I always just went ahead with mapping, assuming a unique (ie high mapping quality) match is an ortholigous match. Ortholigous regions don’t have to be syntenic after all. Misha

On Thu, Sep 24, 2020 at 7:50 AM TonyKess notifications@github.com wrote:

Does the ancestral/outgroup genome need to be filtered only for sites that are in syntenty with the focal species? IE if I'm using the charr and salmon genomes, would I need to do a genome-genome alignment and only pick conserved regions first, or is any of that handled internally by ANGSD?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-698323067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGCTTPKDSXCL26XCQKTSHM6AZANCNFSM4RU7DBMA .

z0on commented 4 years ago

Hi Han - i would exclude hybrids when analyzing sympathic morphs (ie when their genetic divergence is clearly not due just to spatial separation). If morphs are separated in space, leave the hybrids, they will provide valuable SFS info for demographic inference. Looks like you better remove yours!

Definitely do not apply the HWE filter, I agree. Use allele bias and genotyping rate (minInd) filters, plus strand bias if you have WGS, GBS or 2bRAD (but not any other RAD) data.

Misha

On Tue, Sep 22, 2020 at 10:24 AM HanXiaoEvo notifications@github.com wrote:

Thank you Misha!!!!

I am quite new into it so I am wondering if I can just ask two more questions, not so relevant to ANGSD but choosing samples for demographic analyses in general.

1.

As the Fst is clear but weak (if I can say that), by doing simple PCA, structure-like plot I do observe somehow clustering pattern but there are some hybrids. It seems that people tend to exclude the hybrids because that these are not due to demographic process so may confound the results or what? 2.

Because I work with sympatic morphs (4 in total), for SNP calling for popG I didn't exclude markers out of HWE as there are still possible gene flows. However for the denmographic calculation, if I use the simple dosaf1 assuming HWE, what will be the impact for that? I will for sure check other options as to take inbreeding into account, but super cirous about such sconsiderations.

Thank you very much again!

Cheers, Han

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-696793478, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGBWOBLXZQWRSXU435TSHC6SFANCNFSM4RU7DBMA .

HanXiaoEvo commented 4 years ago

Thank you Misha!

So for the sympatric morphs, the hybrids should be removed because they may not reflect historical demographic history but rather contemporary ongoing gene flow right? What if there is one morph, saying the fish eater, with possible origins of ontogenetic shifts from a pelagic morph. I mean genetically almost all fish eaters are not distinct (Fst between pelagic is 0.02, including hybrids from another morph).

I guess the case is quite tricky, so the strategy should be working on distinct morphs first :) And to respond to Tony's question, I don't need to to any pre mapping of the salmon genome to the charr right?

Cheers, Han

TonyKess commented 4 years ago

Thanks Misha - to clarify, I should be using reads aligned to the ancestral genome and then specify that ancestral genome using the -anc flag?

z0on commented 4 years ago

That's what I would do, yes! Just check if the mapping efficiency is OK

On Thu, Sep 24, 2020 at 9:19 AM TonyKess notifications@github.com wrote:

Thanks Misha - to clarify, I should be using reads aligned to the ancestral genome and then specify that ancestral genome using the -anc flag?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-698375337, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGCEOGFDEDLOFB4NM43SHNINZANCNFSM4RU7DBMA .

HanXiaoEvo commented 4 years ago

Oh Misha I got a bit confused now about the alignment. Should I use the bam file aligned with a charr genome and then use a salmon genome as the ancestral state? Or should I use the bam file of charr aligned to a salmon genome then use the salmon genome as the ancestral state? Thank you!

Cheers, Han

z0on commented 4 years ago

Hi Han - I usually do the latter, i.e. align to a sister species genome and use both as -ref and -anc. Misha

On Thu, Sep 24, 2020 at 9:38 AM HanXiaoEvo notifications@github.com wrote:

Oh Misha I got a bit confused now about the alignment. Should I use the bam file aligned with a charr genome and then use a salmon genome as the ancestral state? Or should I use the bam file of charr aligned to a salmon genome then use the salmon genome as the ancestral state? Thank you!

Cheers, Han

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-698386855, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGCUGDAS3T7R7IMPWRTSHNKUPANCNFSM4RU7DBMA .

z0on commented 4 years ago

...otherwise you would need to align the two genomes first (as Tony suggested). This is a lot of work and I don't quite see the advantage (I may be wrong though, would love to hear alternative opinion!) Misha

On Thu, Sep 24, 2020 at 9:38 AM HanXiaoEvo notifications@github.com wrote:

Oh Misha I got a bit confused now about the alignment. Should I use the bam file aligned with a charr genome and then use a salmon genome as the ancestral state? Or should I use the bam file of charr aligned to a salmon genome then use the salmon genome as the ancestral state? Thank you!

Cheers, Han

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

HanXiaoEvo commented 4 years ago

Hi Misha, thanks a lot and I have the bam file aligned to a salmon genome created before to compare the results in PCA. The alignment of charr to salmon still gives me around 70% to 80% mapping success so I think it should be fine ( but I guess not as high as human to chimp).

Thank you very much again for all the valuable discussion as normally people don't mention any about SFS generation in the papers!

Cheers, Han

z0on commented 4 years ago

:) happy to help! On advice about handling genomics of unusual creatures : shameless plug - please have a look at https://matzlab.weebly.com/uploads/7/6/2/2/76229469/fantasticbeastssequence.pdf

On Thu, Sep 24, 2020 at 11:29 AM HanXiaoEvo notifications@github.com wrote:

Hi Misha, thanks a lot and I have the bam file aligned to a salmon genome created before to compare the results in PCA. The alignment of charr to salmon still gives me around 70% to 80% mapping success so I think it should be fine ( but I guess not as high as human to chimp).

Thank you very much again for all the valuable discussion as normally people don't mention any about SFS generation in the papers!

Cheers, Han

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-698452042, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGAKTSJMUK3VGRDBL4LSHNXVDANCNFSM4RU7DBMA .

HanXiaoEvo commented 4 years ago

Hi Misha, I checked your lab page haha and the paper looks cool! Will check it in more details this weekend as doing 6 hours-zoom per day makes my eyes quite unhappy :P Just to report, that I made my first (finally) ANGSD run for SFS estimation using the salmon genome as the ancestral state for three morphs a few seconds ago! (screen*3 overnight)

Back to the HWE question, as I removed the hybrids (or bad quality samples as well as messy PCA samples), then is it just OK to do doSaf 1? As the doSaf 2 says that outputs are not sample allele frequency likelihoods but sample alle posteriors. What should that take into account? Thank you!

Cheers, Han

z0on commented 4 years ago

(I think you mean -doSaf 3 as alternative to -doSaf 1) Just do -doSaf 1 - unless you suspect variation in per-individual heterozygosity (inbreeding), in which case it is a bit more involved - but I don't think it is your case. Misha

On Thu, Sep 24, 2020 at 3:41 PM HanXiaoEvo notifications@github.com wrote:

Hi Misha, I checked your lab page haha and the paper looks cool! Will check it in more details this weekend as doing 6 hours-zoom per day makes my eyes quite unhappy :P Just to report, that I made my first (finally) ANGSD run for SFS estimation using the salmon genome as the ancestral state for three morphs a few seconds ago! (screen*3 overnight)

Back to the HWE question, as I removed the hybrids (or bad quality samples as well as messy PCA samples), then is it just OK to do doSaf 1? As the doSaf 2 says that outputs are not sample allele frequency likelihoods but sample alle posteriors. What should that take into account? Thank you!

Cheers, Han

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-698577456, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGDMVWA4GQOXEA7KGH3SHOVHNANCNFSM4RU7DBMA .

TonyKess commented 4 years ago

Another question on this thread as it seems relevant to discussing how to use ancestral state information. If I'm interested in using a program like VolcanoFinder which requires both spatial information and polarized alleles, is there an easy way to match spatial information from the reference genome of our species of interest with the ancestral genome? Tony

z0on commented 4 years ago

HI Tony - Very cool method! thanks for sharing - I was not aware of it. I am not sure how to properly align genomes (feels like it might be important in this case, but I never faced this challenge), but just to try VolcanoFinder to see if there is anything to catch I would just forge ahead with both -ref and -anc set to the sister species genome. Misha

On Fri, Sep 25, 2020 at 9:12 AM TonyKess notifications@github.com wrote:

Another question on this thread as it seems relevant to discussing how to use ancestral state information. If I'm interested in using a program like VolcanoFinder http://degiorgiogroup.fau.edu/Manual_VolcanoFinder_v1.0.pdf which requires both spatial information and polarized alleles, is there an easy way to match spatial information from the reference genome of our species of interest with the ancestral genome? Tony

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-698952522, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGHBYGEJBFG27I274GDSHSQLRANCNFSM4RU7DBMA .

HanXiaoEvo commented 4 years ago

Hi Misha, I still have one question regarding the use of an ancestral state. I am checking demographic papers in salmonoids and found almost no one ever used the unfolded sfs. The reason as I can found is" The minor folded site frequency spectrum was used due to the lack of a trinucleotide substitution matrix for salmonids and sequencing data for outgroup species."

How should I deal with it? I am not quite sure whether it matters as most people used unfolded sfs without saying anything. Thank you very much!

Best regards, Han

z0on commented 4 years ago

Hi Han - folded is definitely more conservative and would likely show broadly similar results. On another hand, unfolded AFS models can account for the proportion of misidentified ancestral states so “imperfect” outgroup (ie too close to focal species) is not a deal-breaker. I am not really sure what the “trinucleotide substitution matrix” has to do with this issue, to be honest.

So I would try both folded and unfolded, see if you can already make your point with folded AFS.

Have a look at this repo, maybe it is helpful (warning: shameless plug): https://github.com/z0on/AFS-analysis-with-moments

Cheers Misha

On Sun, Sep 27, 2020 at 1:21 PM HanXiaoEvo notifications@github.com wrote:

Hi Misha, I still have one question regarding the use of an ancestral state. I am checking demographic papers in salmonoids and found almost no one ever used the unfolded sfs. The reason as I can found is" The minor folded site frequency spectrum was used due to the lack of a trinucleotide substitution matrix for salmonids and sequencing data for outgroup species."

How should I deal with it? I am not quite sure whether it matters as most people used unfolded sfs without saying anything. Thank you very much!

Best regards, Han

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/350#issuecomment-699669684, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGBGTBL5LXCVVJT3RODSH57CRANCNFSM4RU7DBMA .

HanXiaoEvo commented 4 years ago

Hi Misha,

Never feel shame to share your wisdom! I am mostly working with popG and a new in Demographic analysis. What made me hesitate to conduct the model is that I am not sure if the way I generate SFS is proper enough to NOT to confound the downstream model.

As I am working with ddRADSeq, I have seen people using one SNP one loci but followed by calculation of the actual ratio of SNPs to invariant sites in the spectrum is skewed, which confused me if I should follow or use all SNPs (with filtering). I have also seen some less discussion about missingness but I am not sure how it will affect the SFS (while I should able to control the missingness by filtering somehow I guess).

Do you have any idea or recommendation readings regarding such kinds of details? As the demographic models are really sensitive and errors can be difficult to be detected/interpreted, I think I should pay more caution then simply running it.

Thank you very much!

Cheers, Han

z0on commented 4 years ago

Hi Han - I am very glad you are moving into a new area and want to explore! People are only starting using SFS broadly so you will be one of the leaders in the field.

Have a look at the Fantastic Beasts paper to get some broader perspective (hopefully). Main thing is, with popgen every study system is different and the data must be first explored and plotted every which way, with different filtering parameters, to see what is going on and to make a decision about proper analysis. For pop structure (PCA, ADMIXTURE) start using only high-frequency SNPs (maf>=0.05) with genotyping rate 0.8 (it the site must be genotyped in 80% of all samples). For SFS, remove the maf filter (and also snp_pval filter if using angsd) and look at 2dSFS plots - do they look reasonable or weird. Then try some models, folded or unfolded. In my experience, the inference of major aspects of the model is pretty robust to filtering and folding-unfolding, so that's good news - but you will have to confirm this with your own data.

For super-expert advice on fish SFS analysis, try contacting Pierre-Alexandre Gagnaire (https://scholar.google.ca/citations?user=orGqHhAAAAAJ&hl=en), he is the boss in fish demographics.

we should probably take this off the ANGSD site - please feel free to email me at matz@utexas.edu

cheers Misha

On Mon, Sep 28, 2020 at 7:22 AM HanXiaoEvo notifications@github.com wrote:

Hi Misha,

Never feel shame to share your wisdom! I am mostly working with popG and a new in Demographic analysis. What made me hesitate to conduct the model is that I am not sure if the way I generate SFS is proper enough to NOT to confound the downstream model.

As I am working with ddRADSeq, I have seen people using one SNP one loci but followed by calculation of the actual ratio of SNPs to invariant sites in the spectrum is skewed, which confused me if I should follow or use all SNPs (with filtering). I have also seen some less discussion about missingness but I am not sure how it will affect the SFS (while I should able to control the missingness by filtering somehow I guess).

Do you have any idea or recommendation readings regarding such kinds of details? As the demographic models are really sensitive and errors can be difficult to be detected/interpreted, I think I should pay more caution then simply running it.

Thank you very much!

Cheers, Han

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

HanXiaoEvo commented 4 years ago

Thank you Misha,

Yeah we should close this issue and keep the discussions via email!

Cheers, Han