alachins / raisd

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
33 stars 13 forks source link

How are start and end positions of a window determined? #7

Closed twooldridge closed 4 years ago

twooldridge commented 5 years ago

Hello,

I was playing around with some results from both RAiSD and SweeD, and I became confused by the start and end positions output by both programs. How exactly are these points determined? It appears to me that the size of the window can seriously influence the computed statistics, for example having a large window with few SNPs will raise VAR in RAiSD (pointed out by me here https://github.com/alachins/raisd/issues/6). Looking at this extreme example from a SweeD run, you can see a major change in the window size (indicated by the final column)

chr start end pos Likelihood Alpha size
chr3 84259599 84259618 84259618 0.00000000 6.000001e+00 19
chr3 84260581 84260646 84260618 0.00000000 4.000001e-01 65
chr3 84261616 84261617 84261618 0.08344922 1.200000e+01 1
chr3 84186720 84338351 84262618 24.11272000 1.579946e-04 151631
chr3 84263616 84263617 84263618 0.16876960 1.200000e+01 1
chr3 84264511 84264618 84264618 0.00000000 6.000001e+00 107
chr3 84265600 84265631 84265618 0.00000000 8.000001e-01 31

And this data has already been masked for low coverage, low mappability, and generally low-quality regions that I might expect to produce this sort of result. I see similar effects with RAiSD, again following masking. Because of this I would like to understand how these boundaries are determined, but looking through the publications and source code for both programs I'm having trouble identifying where this is discussed. Please excuse me if I'm missing something obvious here.

As a side note, my input files for SweeD are in the SweepFinder format, and I exclude monomorphic sites. For RAiSD, my input file is simply a VCF.

Thank you for your help, and please let me know if I can provide any more information.

Best, Brock

idaios commented 5 years ago

Dear Brock, The two programs, RAiSD and SweeD treat windows in a different way.

SweeD: The reported window is determined dynamically while SweeD runs. In detail, the window that maximizes CLR is reported. Assume that you want to calculate CLR at a position X. SweeD starts from the smallest possible window around X and expands it till a maximum threshold which has the value (alpha d) = 12. alpha d is a scaled distance (d in base pairs) with the parameter alpha. The smaller alpha the stronger the selection coefficient is. Thus, in short, for very strong selection, the window can become very large, which is biologically correct: very strong selection affects very large genomic regions. This is exactly what I see in your case. For chr3, the alpha value is small and the window large.

RAiSD: The window size in RAiSD is constant in terms of SNP number. Thus, if a region is sparse (which is a signature of selection), then the constant number of SNPs will occupy a large genomic region in terms of bp.

When you use either tool (or other tools that search for selection), you should be careful with "weird" regions, i.e. centromers. We suggest that these regions should be masked later on, when you finish the analysis. What I'm doing is to exclude centromers as well as a small region around them.

Please contact me if you have further questions pavlos & nikos

twooldridge commented 5 years ago

Hi Pavlos & Nikos,

Thank you for the very detailed response, I see the distinction between RAiSD and SweeD now! One follow up question occurs to me: for the chr3 example I gave above, why would directly adjacent positions not also report a relatively high Likelihood and large-ish window? If selection was strong enough to reduce variation such that SweeD extends out ~75000bp to each side of position 84,262,618, presumably selection has acted over a large enough region that, when starting next door at 84,263,618, SweeD would similarly extend boundaries a good ways to meet (alpha*d)=12.

It is true that for some genomic regions in my data with very high Likelihood values, I do see a gradual "peak" in the Likelihood landscape rather than a single point. This is typically over ~MB sized regions. Perhaps, the effect just needs to be very strong for this to happen? I should also qualify that LD in my populations is VERY low, so it would not be unexpected for signals to disappear quickly at nearby sites.

Hopefully this makes sense, and thanks again for your help.

Best, Brock

idaios commented 5 years ago

Hi Brock,

"Thank you for the very detailed response, I see the distinction between RAiSD and SweeD now! One follow up question occurs to me: for the chr3 example I gave above, why would directly adjacent positions not also report a relatively high Likelihood and large-ish window? If selection was strong enough to reduce variation such that SweeD extends out ~75000bp to each side of position 84,262,618, presumably selection has acted over a large enough region that, when starting next door at 84,263,618, SweeD would similarly extend boundaries a good ways to meet (alpha*d)=12."

Not necessarily because the distance between the hypothetical sweep position and the SNP plays an important role. So for example, it is possible that there are some doupletons near 84,263,618 (or one tripleton etc) and many n-1's near 84,262,618. The likelihood for the 84,262,618 position is high, however the likelihood for the 84,263,618 position is low.

Τhe reported positions are a bit strange. We will check it, if there is some bug there.

best pavlos

idaios commented 5 years ago

Dear Brock, could you please send me the dataset that you generated these results? We'd like to check it usng some printouts.

best pavlos

twooldridge commented 5 years ago

Hi Pavlos,

I've attached the gzipped Sweepfinder-formatted input file for chr3 and that population. You'll see that I keep monomorphic sites in the file if the ancestral state is unknown, per recommendations for using SweepFinder2 (which I've also played around with). I exclude these anyway when I run SweeD, using the default parameters. Please let me know if the file doesn't work, or if I can provide more information. Thank you!

Best, Brock

NUB.chr3.SF.txt.gz

alachins commented 5 years ago

Hi Brock, Can you also provide the command line you used? Best regards, Nikos

twooldridge commented 5 years ago

Hi Nikos,

Of course, I should have included that earlier. The command was:

/n/home11/twooldridge/Software/sweed/SweeD-P -threads 16 -grid 161151 -input sweepfinder_files/NUB.chr3.SF -name NUB.chr3.SweeD

With the grid size chosen to approximate 1kb windows.

Thanks, Brock

On May 2, 2019, at 3:28 AM, alachins notifications@github.com wrote:

Hi Brock, Can you also provide the command line you used? Best regards, Nikos

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alachins/raisd/issues/7#issuecomment-488576359, or mute the thread https://github.com/notifications/unsubscribe-auth/AEACLKCFZYEGHVRLR24LWKDPTKJYRANCNFSM4HELUSTA.