alachins / sweed

Likelihood-based Selective Sweep Detection
GNU General Public License v3.0
35 stars 6 forks source link

Understanding SweeD Results #10

Open noahaus opened 4 years ago

noahaus commented 4 years ago

Hi all,

( I know that this question has been asked and answered, but I found it very tough to follow, so I want to ask a more strategic question).

I am a researcher that is looking to use SweeD as a result for my upcoming paper, and I love that it is straight forward to use. However, I find that I cannot interpret the results easily, and this makes me uncomfortable to say that these results look "good".

I am using SweeD on 700 samples of bacterial isolates that is roughly 3 million bps and 1000 bps per gene. I used a grid value of 100 (which isn't informed by anything, just chose a number that was relatively large) Here is an example of my output:

//1
Position        Likelihood      Alpha   StartPos        EndPos
1.0000  0.000000e+00    1.200000e+03    1.0000  2.0000
33452.9805      0.000000e+00    1.200000e+03    33452.0000      33453.0000
66904.9609      0.000000e+00    3.166228e-02    66525.0000      67886.0000
100356.9414     0.000000e+00    1.200000e+03    100356.0000     100357.0000
133808.9219     0.000000e+00    1.200000e+03    133808.0000     133809.0000
...

yet, I am unsure what is being calculated here. I understand that Likelihood is a CLR, but all my results come out as zero. I also don't quite understand what the position means, especially when it has a decimal next to what I assume is the position in the alignment I use as input.

Based on what I provided in this issue, could you please tell me if my fears are justified? How do I interpret these results? And how can I shift my command to get better results if needed?

Any help would be greatly appreciated!

Noah A. Legall

j-es-craig commented 3 years ago

@noahaus Have you made any progress with this? I have similar questions. While I do get a proper CLR output, I am unsure what the start and end positions mean and I cannot find mention of them in the manual.

Re: the decimal places, I believe they have to do with the alignment.

noahaus commented 3 years ago

@j-es-craig Hey! So I never figured out the questions I posed here, which is sad because the tool would be perfect if the documentation was still maintained.

I ended up just developing my own script to calculate Tajima's D for the isolates, and I'm nearly finished with the program (written in Rust). I found that this was better because Tajima's D can be interpreted. I'll email you once I'm done (maybe within the week).

alachins commented 3 years ago

@noahaus This somehow escaped our noticed. Sorry. Your results, most probably came out as 0.0 because of the small grid value. You could use a value larger than 20,000. The decimal appears because of the way we calculate the positions (equidistantly in-between the first and last SNPs in your data).

@j-es-craig The start and end positions define the extent of the region that was used (the SNPs in that region) for the calculation of the reported score. The start position is the first SNP and the end position is the last SNP, and the reported position with the CLR score is somewhere in-between. We added the start/end position to be able to compare with the respective regions reported by OmegaPlus.

j-es-craig commented 3 years ago

@alachins Thank you for the response.

I just wanted to confirm this interpretation then. Does the position at which the CLR is reported represent the range between the Start and End position? Additionally, the number of SNPs used seems to fluctuate (range between StartPos and EndPos) seems to vary per reported CLR, I was wondering why this is?

alachins commented 3 years ago

Yes, the reported CLR position represents the region [Start, End] because the SNPs in that region were used for the calculation. The general idea is the following. First, a number of predefined positions are found (this depends on the grid parameter). For every position, the CLR score is computed using an algorithm that incrementally includes more SNPs into the calculation. There is a stopping criterion that terminates this algorithm based on the distance between the most distant SNP from the CLR position and the CLR position itself. For one CLR position, this is done for both the region preceding the CLR position and for the region following the CLR position. Thus, the number of SNPs in the region [Start, End] is not fixed and depends on the stopping criterion, which is a function of distance.

hungweichen0327 commented 3 years ago

Dear community,

Scientists usually draw the manhattan plot of likelihood across the chromosome and set the threshold to detect selective regions. But I am not sure the selective regions where they described the candidate genes are used? I would like to use the CLR results by SweeD as an example. SweeD_example

In this simple example above, The highest likelihood value is at physical position 13105.4385. If I want to describe the candidate genes in this selective region, which region I used is correct? The region between 13105.4385-23104.8770? Or the region between 3106-26151?

I am greatly appreciated for anyone's help! Thank you.

Hung-Wei Chen

Yes, the reported CLR position represents the region [Start, End] because the SNPs in that region were used for the calculation. The general idea is the following. First, a number of predefined positions are found (this depends on the grid parameter). For every position, the CLR score is computed using an algorithm that incrementally includes more SNPs into the calculation. There is a stopping criterion that terminates this algorithm based on the distance between the most distant SNP from the CLR position and the CLR position itself. For one CLR position, this is done for both the region preceding the CLR position and for the region following the CLR position. Thus, the number of SNPs in the region [Start, End] is not fixed and depends on the stopping criterion, which is a function of distance.

alachins commented 3 years ago

Hello Hung-Wei, In your example, the selective sweep region corresponding to the highest likelihood is 3106-26151. Best regards, Nikos A.