alachins / raisd

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
33 stars 13 forks source link

very high FPR and very low TPR #14

Closed palomo11 closed 4 years ago

palomo11 commented 4 years ago

Hi,

I have run RAiSD in several vcf generated from bacterial species (each of them representing 24 individuals).

Here I describe one example (but similar results were obtained for the other species:

./RAiSD -n Species1 -I Species1.vcf -f -y 1 -P -k 0.05 

output:

 SORTED DATA (FPR 0.050000)
 Size           19
 Highest Score      7.753684998
 Lowest Score       0.000000000
 FPR Threshold      7.753684998
 Threshold Location 0
./RAiSD -n Species1_TPR -I Species1.vcf -f -y 1 -l 7.753684998

output:

 SCORE COUNT (Threshold 7.753685)
 TPR    0.052632

I'm able to get the Manhattan plot and 3 regions pass the threshold set at 0.999.

These regions contain 122, 73 and 47 genes, respectively.

Do you know if there is any issue going on? Do these results mean that there's no clear selective sweep going on?

alachins commented 4 years ago

The -k and -l options are not to be used on the same dataset. You need to use a neutral dataset with -k to have RAiSD give you the cutoff threshold for FPR 0.05 for instance. You can then use the reported threshold on a dataset with selection via -l to get the TPR. By using the same dataset, the reported TPR you see (TPR=0.052632) is essentially the FPR you used (-k 0.05).

palomo11 commented 4 years ago

Thanks for the answer. As I don't have a neutral dataset, can I use d1/msneutral1.out with -k 0.05 and then use the reported FRP threshold (i.e. 0.000221867) as -l for my dataset?

alachins commented 4 years ago

You need to use a neutral dataset to calculate the FPR, and a dataset with selection to calculate the TPR. If I understand correctly, your Species1.vcf is real data, correct? In this case, you need to calculate a cutoff threshold. One way would be through simulated data using the same demographic model as your real data. An easier way, and also a widely used one, is to assume your entire dataset is neutral, except for the highest-score regions. This is practically done when you set the threshold for a Manhattan plot.

palomo11 commented 4 years ago

OK, I understand. Then, I can assume that those regions which passed the threshold set at 0.999 for the Manhattan plot are with high probabilities positively selected?

alachins commented 4 years ago

Thats right!