fgvieira / ngsLD

Calculation of pairwise Linkage Disequilibrium (LD) under a probabilistic framework
GNU General Public License v2.0
43 stars 7 forks source link

Determining --max_kb_dist #57

Open baileyjc opened 3 weeks ago

baileyjc commented 3 weeks ago

Hello, thank you for providing a great package and for being willing to address issues on Git Hub. It has been very helpful to me in refining my code. I am running into an issue that I hesitate to bring up because I keep wondering if I have not looked hard enough for the information so if that is the answer to my question please let me know and my apologies for the open issue.

That being said, my issue is deciding on what number to give --max_kb_dist. I have not found a similar issue but my apologies if one has already been opened. I ran ngsLD over a single chromosome ~30 million base pairs in length and ~80,000 SNPs present between the 154 individuals in my dataset with --max_kb_dist set to 0. I tried to run the LD_decay script with the file but I believe the ngsLD output file is too large (unzipped ~220 Gb). I used a smaller file (~20Gb) setting --max_kb_dist to 1000 and I did not run into issues with the LD_decay script until I used the larger file. What I found from the smaller file was LD decay decreased to a more or less asymptote at 200kb (x-axis) at 0.05 LD (I think that is the y-axis label?). However, when I filter my file from the --max_kb_dist 0, keeping only linked sites with an R-squared > 0.5 (column 7) the mean value in column 3 of the output file is 6 million. I wanted to see the sites that were in linkage and the distance between them on average. I also get a max distance of ~30 million, the actual value is slightly less than the full chromosome length. When I only include linked sites that have >20 million base pairs between them and have an R-squared > 0.5, I get 200k linked sites. For a chromosome that is only 30 million base pairs, it is difficult to believe they are linked via recombination but instead by some other evolutionary influence.

However, this brings me to my main question, how are people determining --max_kb_dist given the challenges of plotting LD_decay for the full chromosome because of its size and my other larger chromosomes with more SNPs won't finish running in ngsLD given my cluster's 3 day limit? I have seen values from 1 to 10 to 50 for --max_kb_dist (but no real evidence for why they chose the value) and with regard to the LD_decay plot people determine the distance once the exponential line passes 0.1 (is that right?). At what distance are SNPs considered linked versus maybe due to selection or another explanation? I have read through the ngsLD github page and your ngsLD manuscript and I have not been able to find an answer to fully explain what is the appropriate value to use with --max_kb_dist. I have also been trying to find common linkage distances for fish but have not found this information so far. Thank you for your time and again for providing this useful package!

fgvieira commented 6 days ago

Hi @baileyjc,thanks for your interest in ngsLD!

Before going into your questions, I'll just mention that the --max-kb-distoption is meant to speed-up analysis since (as you notice) an LD analyses on a full chromosome can be quite slow. Depending on the objective of the analyses, it might not make sense to use it (more on that later).

Just to make sure I understood your question:

Going into your questions, and even though it is a bit hard without looking at the data:

What I found from the smaller file was LD decay decreased to a more or less asymptote at 200kb (x-axis) at 0.05 LD (I think that is the y-axis label?). However, when I filter my file from the --max_kb_dist 0, keeping only linked sites with an R-squared > 0.5 (column 7) the mean value in column 3 of the output file is 6 million. I wanted to see the sites that were in linkage and the distance between them on average. I also get a max distance of ~30 million, the actual value is slightly less than the full chromosome length. When I only include linked sites that have >20 million base pairs between them and have an R-squared > 0.5, I get 200k linked sites. For a chromosome that is only 30 million base pairs, it is difficult to believe they are linked via recombination but instead by some other evolutionary influence.

Two sites can be in LD for several reasons: (e.g.) selection, mutation rate, genetic drift, or gene interactions. In your case, I'd say (but please check) that most of your sites in LD are at relatively short distances (hence the 200kb asymptote), but that you also have some very distant ones. These sites seem to be LRLD and can reflect (e.g.) drift or gene interactions (these could even be between different chromosomes!).

How to define --max-kb-dist

It depends on the goal of your study. If you want to study short range LD to look for effects of (e.g.) demography or selection, then your threshold of 200kb sounds reasonable. However, if you want to study LRLD, then I'd use --max-kb-dist 0 (prob with all chromosomes at the same time) and then filter the output for long range interactions.

Hope this makes sense! Best,

PS - You can also calculate LD on your entire dataset, but perform LD_decay analyses on a random subset of interactions. You should get an LD_decay plot similar to when you used --max_kb_dist 1000.

baileyjc commented 2 days ago

Hi @fgvieira,

Thank you for your reply and for taking the time to read through my issue!

I can provide data if need be but I think you have answered my questions. The main issue I was having was deciding on the --max_kb_dist filter and not knowing what downstream analyses may or may not affected based on the number I chose to give --max_kb_dist.

For some of my samples, I agree that the linkage should be short distances because the population is large and the gene flow is high, however, some of my samples are from island-like environments. Preliminary analyses seem to support minimal if any gene flow which leads to me think that the founder effect might be at work resetting recombination in the now-isolated populations over the last ~10k years. I plan to run each population through ngsLD and generate a corresponding LD decay plot to see if this is the case.

Given your suggestions, I think I will set --max_kb_dist to 200k for analyses including all populations and for those populations that might have long-distance linkage those sites won't be identified and filtered out. They will just increase the differentiation with other populations. I'm not sure there is really any way around that. I guess I could run my SNP files through ngsLD using a --max_kb_dist filter to get rid of most of the linked sites and then run again with --max_kb_dist set to 0 to identify those linked sites.

If I use --max_kb_dist with the fit_LDdecay.R script. Then I should get the same plot as if I filtered with --max_kb_dist in ngsLD, right? For example, if I ran ngsLD with --max_kb_dist 100 and then plot the LD decay this would produce the same plot as running ngsLD with --max_kb_dist set to 0 and then plotting LD decay but using the --max_kb_dist set to 100. My apologies if that's not clear.

Thank you so much for the support @fgvieira!

fgvieira commented 2 days ago

If I use --max_kb_dist with the fit_LDdecay.R script. Then I should get the same plot as if I filtered with --max_kb_dist in ngsLD, right? For example, if I ran ngsLD with --max_kb_dist 100 and then plot the LD decay this would produce the same plot as running ngsLD with --max_kb_dist set to 0 and then plotting LD decay but using the --max_kb_dist set to 100. My apologies if that's not clear.

Yes, you should get similar results (at least that was the idea 😄).