alachins / raisd

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
33 stars 13 forks source link

Window size in RAiSD to compare results from SweeD #37

Open DuttaAnik opened 2 years ago

DuttaAnik commented 2 years ago

Hi, I am currently doing a selection scan using SweeD and RAiSD. I would like to merge both outputs to find overlapping regions under selection. But I am having a confusion with the window-based output. For example, I ran SweeD using the following code: SweeD -input Scaffold_1__2_contigs__length_78070536.vcf -grid 7807 -folded -noSeparator -name 1963_Chr_1 -sampleList 1963_pop.txt The contig length is 78070536 and I used the grid size of 7807 to obtain a window of 10 kb.

However, if I want to compare the result with RAiSD, then do I also need to set a window size of 10 kb using -w 10 in RAiSD? Because the default is 50 kb if I am not mistaken.

Also, is it recommended to use individual VCF for each chromosome as an input in RAiSD or all the chromosomes in one VCF for a single population would be suitable? Because I am working with a non-model organism whose reference genome is not at a chromosome scale yet. It has ~100 scaffolds.

Thank you.

alachins commented 2 years ago

In SweeD, the grid parameter defines the number of evaluation points, not evaluation windows. So "The contig length is 78070536 and I used the grid size of 7807" will give you one evaluation point every 10kb. SweeD is performing a MLE. The report file includes the window size for each evaluation point.

In RAiSD, the window width is not kb but the number of SNPs. To perform the comparison you want, both tools need to be executed with the same grid value to generate compatible reports. You can provide one large VCF with many chromosomes. Each chromosome will be processed independently of the others.

On Fri, Jul 1, 2022 at 1:44 PM Anik Dutta @.***> wrote:

Hi, I am currently doing a selection scan using SweeD and RAiSD. I would like to merge both outputs to find overlapping regions under selection. But I am having a confusion with the window-based output. For example, I ran SweeD using the following code: SweeD -input Scaffold_1__2_contigs__length_78070536.vcf -grid 7807 -folded -noSeparator -name 1963_Chr_1 -sampleList 1963_pop.txt The contig length is 78070536 and I used the grid size of 7807 to obtain a window of 10 kb.

However, if I want to compare the result with RAiSD, then do I also need to set a window size of 10 kb using -w 10 in RAiSD? Because the default is 50 kb if I am not mistaken.

Also, is it recommended to use individual VCF for each chromosome as an input in RAiSD or all the chromosomes in one VCF for a single population would be suitable? Because I am working with a non-model organism whose reference genome is not at a chromosome scale yet. It has ~100 scaffolds.

Thank you.

— Reply to this email directly, view it on GitHub https://github.com/alachins/raisd/issues/37, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALKWCXVZAK7HDDG2NBHKJ3VR3DY7ANCNFSM52MJYIRA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Nikolaos Alachiotis

DuttaAnik commented 2 years ago

Thank you for the reply. But there is a problem. I am using the SweeD version 3.3.2, not the latest one where you get the start and stop position. The reason for not using the latest version is that I do not get any likelihood values in the output. With the same command using SweeD 3.3.2, I get results with different likelihood values. But with the latest SweeD version, all the likelihood values are Zero. Do not know what is causing this.

But coming back to the point, for example, I have an output like this from SweeD:

Position    Likelihood  Alpha
53744.0000  2.599466e-01    1.939265e+01
63731.3682  6.088593e-01    7.807052e-05
73718.7363  1.341805e+00    8.917006e-05
83706.1045  3.494308e-01    5.031090e-03
93693.4727  3.847188e+00    1.736058e-04
103680.8408 4.946577e-08    3.076922e-01
113668.2090 9.224968e-08    8.163262e-02
123655.5771 4.783268e-08    3.000000e+00

Here, you said that the report file includes the window size for each evaluation point which is I guess the "Position"? But how can I obtain the window size from such output? Should I add 5 kb up and down from each of the values under the "Position" column, which is above the significance threshold?

alachins commented 2 years ago

I think that adding 5kb on each side will not work. This way you will either overestimate (in the case of small grid sizes) or underestimate (in the case of large grid sizes) the window per evaluation position. You can obtain a much better estimation of the window per position by setting a very large grid size and then cluster together consecutive evaluation positions with the same scores. This should give you a much more accurate estimation of the window per position but it will take longer to execute and the clustering is not automatically done by SweeD.

On Sun, Jul 3, 2022 at 1:21 PM Anik Dutta @.***> wrote:

Thank you for the reply. But there is a problem. I am using the SweeD version 3.3.2, not the latest one where you get the start and stop position. The reason for not using the latest version is that I do not get any likelihood values in the output. With the same command using SweeD 3.3.2, I get results with different likelihood values. But with the latest SweeD version, all the likelihood values are Zero. Do not know what is causing this.

But coming back to the point, for example, I have an output like this from SweeD:

Position Likelihood Alpha 53744.0000 2.599466e-01 1.939265e+01 63731.3682 6.088593e-01 7.807052e-05 73718.7363 1.341805e+00 8.917006e-05 83706.1045 3.494308e-01 5.031090e-03 93693.4727 3.847188e+00 1.736058e-04 103680.8408 4.946577e-08 3.076922e-01 113668.2090 9.224968e-08 8.163262e-02 123655.5771 4.783268e-08 3.000000e+00

Here, you said that the report file includes the window size for each evaluation point which is I guess the "Position"? But how can I obtain the window size from such output? Should I add 5 kb up and down from each of the values under the "Position" column, which is above the significance threshold?

— Reply to this email directly, view it on GitHub https://github.com/alachins/raisd/issues/37#issuecomment-1173053969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALKWCU2F26PI6OLIMISVKTVSFSSNANCNFSM52MJYIRA . You are receiving this because you commented.Message ID: @.***>

-- Nikolaos Alachiotis

DuttaAnik commented 2 years ago

Sorry, this is not a SweeD platform, but still, I want to further ask a question since I am new to this kind of analysis.

Ok, but then how can I conclude for example a region that is under selection? or what is the length of that selection sweep region from those points under the "Position" column? Because as you see those numbers are in Decimal and cannot be an SNP for sure. Let's say I have the following positions on a chromosome for one population which are above the cutoff threshold at 99.5%

Contig  Position    Likelihood  Alpha
Chr3_1963   39556821.3389   906.004 9.723529e-06
Chr3_1963   39566819.457    1083.519    1.166873e-05
Chr3_1963   39576817.5752   1448.674    1.534434e-05
Chr3_1963   39586815.6934   1717.467    1.709699e-05
Chr3_1963   39596813.8115   1633.023    1.516612e-05
Chr3_1963   39606811.9297   1628.539    1.39571e-05
Chr3_1963   39616810.0479   1672.212    1.335288e-05
Chr3_1963   39626808.166    1687.752    1.244344e-05
Chr3_1963   39636806.2842   979.9007    1.267864e-05
Chr3_1963   39646804.4023   758.3056    1.301793e-05
Chr3_1963   39656802.5205   489.5013    1.211109e-05

And for another population, I have only one CLR value above the cutoff like:

Contig  Position    Likelihood  Alpha
Chr3_1958   104724555.5322  328.3317    7.988635e-05

For the first case, should I round up the Position 39556821.3389 to a discrete number and then combine the subsequent positions until 39656802.5205 to define a region under selection?

If that is the case, then how can I determine the region of selective sweep for the second case where I have only one position?

It would really great if you can help me to define the selective sweep region here. Thanks.