auton1 / LDhat

Estimate recombination rates from population genetic data
58 stars 20 forks source link

non-model species, unphased genotype data from whole-genome sequencing #5

Open vicenciooostra opened 6 years ago

vicenciooostra commented 6 years ago

Dear Adam, Gil, and other developers,

Thank you very much for developing and maintaining LDhat. I was hoping you could help me with a couple of questions, and I hope this is the right place for that.

I am using LDhat interval to estimate recombination rates along the genome in different fish populations. We are using SNP data from 30X whole-genome sequences for 4 to 5 individuals per population. I initially tried phasing with SHAPEIT and then using LDHelmet, but I abandoned phasing because I suspect it is not very accurate given our per-population sample sizes (although total N across all populations is 40), and we have no experimental crosses to estimate switch error rates (or any reference panel).

I am now using LDhat interval on unphased genotype data (converted from vcf into LDhat format using vcf-tools), running with 15 million iterations, a burn-in of 150,000, and a block penalty of 50 (after evaluating 0, 5, 10 and 20). I computed my own likelihood tables and used a single theta estimated for the whole metapopulation (as the point is to compare across populations, so I wanted to treat them equally). My first question is, do these parameters make sense?

Looking at the MCMC chains, the runs converge, in the sense that the likelihoods and the number of change points stabilise within a million iterations and then stay the same (although in one case there is a jump to a different value around the 10 millionth iteration). However the map lengths vary more between different iterations, in some cases a lot more where there is a lot of spread even in the final iterations. In some cases, but certainly not always, this occurs on chromosomes and populations that have few SNPs (e.g. 10,000 or so). Why would this be? I suspect this might give biased recombination estimates. Should I change my parameters, e.g. increase the burn-in or numbers of runs? (here are some plots to illustrate: https://www.dropbox.com/s/lona2w9egwtqcjt/ldhat_convergence_illustrative_plots.pdf?dl=0 mainly from cases that failed to converge, of the likelihoods, number of change points, and map lengths plotted against the iteration / run).

After running stats to get the actual recombination rates across the genome, I am getting sensible results overall when comparing populations. However when comparing in detail with LD calculated independently (using vcftools geno-r2) I find some regions of supposedly very low recombination where LD is also low (while it should be high if recombination is low). Averaging both measures in windows of 100 kb, the correlations between them are around -0.30. Given that LDhat uses LD to estimate rates of recombination I would’ve expected a higher (absolute) correlation between these two measures. Why do you think this is so low? Calculating LD based on genotype correlations (as vcftools does) with only 4 or 5 individuals per populations is obviously not extremely precise.

Finally, although I am not specifically looking for hotspots, do you think I should model them anyway? I am now summarising rho in 100 kb windows with at least 20 SNPs. I wonder how much my results will be biased if hotspots turn out to be very important, as I am not sure how to assess (lack of) fit of the non-hotspot model.

Thanks very much for any input you’d be able to provide. Best regards, Vicencio

auton1 commented 6 years ago

Vicencio,

A few thoughts:

Good luck!

Adam

On 20 November 2017 at 07:40, vicenciooostra notifications@github.com wrote:

Dear Adam, Gil, and other developers,

Thank you very much for developing and maintaining LDhat. I was hoping you could help me with a couple of questions, and I hope this is the right place for that.

I am using LDhat interval to estimate recombination rates along the genome in different fish populations. We are using SNP data from 30X whole-genome sequences for 4 to 5 individuals per population. I initially tried phasing with SHAPEIT and then using LDHelmet, but I abandoned phasing because I suspect it is not very accurate given our per-population sample sizes (although total N across all populations is 40), and we have no experimental crosses to estimate switch error rates (or any reference panel).

I am now using LDhat interval on unphased genotype data (converted from vcf into LDhat format using vcf-tools), running with 15 million iterations, a burn-in of 150,000, and a block penalty of 50 (after evaluating 0, 5, 10 and 20). I computed my own likelihood tables and used a single theta estimated for the whole metapopulation (as the point is to compare across populations, so I wanted to treat them equally). My first question is, do these parameters make sense?

Looking at the MCMC chains, the runs converge, in the sense that the likelihoods and the number of change points stabilise within a million iterations and then stay the same (although in one case there is a jump to a different value around the 10 millionth iteration). However the map lengths vary more between different iterations, in some cases a lot more where there is a lot of spread even in the final iterations. In some cases, but certainly not always, this occurs on chromosomes and populations that have few SNPs (e.g. 10,000 or so). Why would this be? I suspect this might give biased recombination estimates. Should I change my parameters, e.g. increase the burn-in or numbers of runs? (here are some plots to illustrate: https://www.dropbox.com/s/lona2w9egwtqcjt/ldhat_convergence_illustrative_ plots.pdf?dl=0 mainly from cases that failed to converge, of the likelihoods, number of change points, and map lengths plotted against the iteration / run).

After running stats to get the actual recombination rates across the genome, I am getting sensible results overall when comparing populations. However when comparing in detail with LD calculated independently (using vcftools geno-r2) I find some regions of supposedly very low recombination where LD is also low (while it should be high if recombination is low). Averaging both measures in windows of 100 kb, the correlations between them are around -0.30. Given that LDhat uses LD to estimate rates of recombination I would’ve expected a higher (absolute) correlation between these two measures. Why do you think this is so low? Calculating LD based on genotype correlations (as vcftools does) with only 4 or 5 individuals per populations is obviously not extremely precise.

Finally, although I am not specifically looking for hotspots, do you think I should model them anyway? I am now summarising rho in 100 kb windows with at least 20 SNPs. I wonder how much my results will be biased if hotspots turn out to be very important, as I am not sure how to assess (lack of) fit of the non-hotspot model.

Thanks very much for any input you’d be able to provide. Best regards, Vicencio

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/auton1/LDhat/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AElWYyRJJvpNcgB9Tno517za0iqHtZ2wks5s4Z1ogaJpZM4Qkc8O .

-- Adam Auton

vicenciooostra commented 6 years ago

Hi Adam,

thank you so much for your helpful suggestions, including not using the hotspot model.

I am not running LDhat across populations; each of my runs is within a single population. After the LDhat runs I then compare rates of recombination, and in particular the spatial distribution of high and low recombination regions across the chromosome, between different populations.

Yes, I am currently running on whole chromosomes. Thank you for the excellent suggestion to split the chromosomes into chunks of 1000-2000 SNPs, I will try that and see if I get better convergence. Just to clarify, I am using interval, not rhomap, to assess variability in recombination rates across the chromosome.

One final question: I am running LDhat separately for each population, with the goal of comparing different populations. Specifically, I am comparing them in a pairwise manner, i.e. pop A1 with pop A2, pop B1 with pop B2, etc. Now, in many cases the populations that I am comparing differ a lot in number of SNPs. In order to make a “fair” comparison between pop A1 and A2 if they differ a lot in numbers of SNPs, I have used only the SNPs that are shared between pop A1 and pop A2. In other words, when preparing my input files for pop A1 I have removed any SNPs that are not also found in pop A2, and when preparing my input files for pop A2 I have removed any SNPs that are not also found in pop A1. The reason is that pop A1 and A2 differ a lot in diversity, and also in recombination, and I want to be sure that the difference in recombination that I measure between them is not due to simply differing in diversity. Do you think this is indeed a potentially confounding factor, and do you think my solution is reasonable?

All the best, Vicencio