Open noahaus opened 4 years ago
@noahaus Have you made any progress with this? I have similar questions. While I do get a proper CLR output, I am unsure what the start and end positions mean and I cannot find mention of them in the manual.
Re: the decimal places, I believe they have to do with the alignment.
@j-es-craig Hey! So I never figured out the questions I posed here, which is sad because the tool would be perfect if the documentation was still maintained.
I ended up just developing my own script to calculate Tajima's D for the isolates, and I'm nearly finished with the program (written in Rust). I found that this was better because Tajima's D can be interpreted. I'll email you once I'm done (maybe within the week).
@noahaus This somehow escaped our noticed. Sorry. Your results, most probably came out as 0.0 because of the small grid value. You could use a value larger than 20,000. The decimal appears because of the way we calculate the positions (equidistantly in-between the first and last SNPs in your data).
@j-es-craig The start and end positions define the extent of the region that was used (the SNPs in that region) for the calculation of the reported score. The start position is the first SNP and the end position is the last SNP, and the reported position with the CLR score is somewhere in-between. We added the start/end position to be able to compare with the respective regions reported by OmegaPlus.
@alachins Thank you for the response.
I just wanted to confirm this interpretation then. Does the position at which the CLR is reported represent the range between the Start and End position? Additionally, the number of SNPs used seems to fluctuate (range between StartPos and EndPos) seems to vary per reported CLR, I was wondering why this is?
Yes, the reported CLR position represents the region [Start, End] because the SNPs in that region were used for the calculation. The general idea is the following. First, a number of predefined positions are found (this depends on the grid parameter). For every position, the CLR score is computed using an algorithm that incrementally includes more SNPs into the calculation. There is a stopping criterion that terminates this algorithm based on the distance between the most distant SNP from the CLR position and the CLR position itself. For one CLR position, this is done for both the region preceding the CLR position and for the region following the CLR position. Thus, the number of SNPs in the region [Start, End] is not fixed and depends on the stopping criterion, which is a function of distance.
Dear community,
Scientists usually draw the manhattan plot of likelihood across the chromosome and set the threshold to detect selective regions. But I am not sure the selective regions where they described the candidate genes are used? I would like to use the CLR results by SweeD as an example.
In this simple example above, The highest likelihood value is at physical position 13105.4385. If I want to describe the candidate genes in this selective region, which region I used is correct? The region between 13105.4385-23104.8770? Or the region between 3106-26151?
I am greatly appreciated for anyone's help! Thank you.
Hung-Wei Chen
Yes, the reported CLR position represents the region [Start, End] because the SNPs in that region were used for the calculation. The general idea is the following. First, a number of predefined positions are found (this depends on the grid parameter). For every position, the CLR score is computed using an algorithm that incrementally includes more SNPs into the calculation. There is a stopping criterion that terminates this algorithm based on the distance between the most distant SNP from the CLR position and the CLR position itself. For one CLR position, this is done for both the region preceding the CLR position and for the region following the CLR position. Thus, the number of SNPs in the region [Start, End] is not fixed and depends on the stopping criterion, which is a function of distance.
Hello Hung-Wei, In your example, the selective sweep region corresponding to the highest likelihood is 3106-26151. Best regards, Nikos A.
Hi all,
( I know that this question has been asked and answered, but I found it very tough to follow, so I want to ask a more strategic question).
I am a researcher that is looking to use SweeD as a result for my upcoming paper, and I love that it is straight forward to use. However, I find that I cannot interpret the results easily, and this makes me uncomfortable to say that these results look "good".
I am using SweeD on 700 samples of bacterial isolates that is roughly 3 million bps and 1000 bps per gene. I used a grid value of 100 (which isn't informed by anything, just chose a number that was relatively large) Here is an example of my output:
yet, I am unsure what is being calculated here. I understand that Likelihood is a CLR, but all my results come out as zero. I also don't quite understand what the position means, especially when it has a decimal next to what I assume is the position in the alignment I use as input.
Based on what I provided in this issue, could you please tell me if my fears are justified? How do I interpret these results? And how can I shift my command to get better results if needed?
Any help would be greatly appreciated!
Noah A. Legall