alachins / raisd

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
33 stars 13 forks source link

Inflated mu-statistic due to gaps in assembly (and no variants) #6

Closed twooldridge closed 4 years ago

twooldridge commented 5 years ago

Hello,

Thank you for developing this tool; its speed is very impressive and makes it much more convenient than SweeD, SweepFinder, etc. I was exploring some results today and noticed an odd pattern in a specific region of the genome:

image

It seemed that there were gaps on either side of the mu-statistic "peak" where no stats had been reported. Upon inspecting the Report file more closely, I can see that this is the case:

129555088 | 129552920 | 129557248 | 5.595e-07 | 0.34 | 1.842 | 3.503e-07 | chr4 129555088 | 129552936 | 129557248 | 5.574e-07 | 0.36 | 1.872 | 3.756e-07 | chr4 129567032 | 129552984 | 129581080 | 3.632e-06 | 0.34 | 1.865 | 2.304e-06 | chr4 129567088 | 129553072 | 129581096 | 3.623e-06 | 0.34 | 1.865 | 2.298e-06 | chr4

You can see how, in the first column, the middle position makes a large ~10kb jump from row2 to row3, and no stats are reported for that region. The new start and end positions for row3, and subsequent rows, indicate a much larger window than is typical of the rest of the file. I see the same phenomenon up until the end of this region, where the mu-stats drop back below 1e-06.

Thinking that this might be due to a decrease in the number of SNPS in this window (suggested by the increase in VAR, column 4), I examined my VCF file in the region from row3 and found that there were next to no SNPs here. However, when I look at my reference genome, I find that nearly this entire region is composed of N's, and is flanked only in the outside ~1kb by actual bases. Essentially, the points with high mu values in the plot above are sitting directly over a gap in our genome assembly, and appear to be an artifact. Am I correct in thinking that this gap, where I'm unable to even call variants, is interpreted by RAiSD as a region of extremely low diversity, thereby inflating the mu-statistic? Do you have any suggestions for how to deal with this? Is there some way I could mask this region of the genome in the input? Or would it be best to remove regions with some proportion of N's (or abnormal window size??) after the analysis has completed?

Thanks in advance for your help. Please let me know if there's any more information I can provide.

Best, Brock

alachins commented 5 years ago

Hi Brock, Unlike other tools that determine which locations to evaluate regardless of where SNPs are, RAiSD computes scores only in regions with SNPs, with more scores reported in SNP-dense regions and less scores reported in SNP-sparse regions. This is one of the reasons for higher accuracy and speed. In your run, the peak over the region with gaps is due to muVAR as you pointed out. You need to remove this region from your file. We are going to introduce an option for this in one of the upcoming releases. Note that there are 4 alternative strategies to deal with missing data implemented in RAiSD at the moment, which you can try. Best regards, Nikos

twooldridge commented 5 years ago

Hi Nikos,

Thank you for that explanation, that makes sense. For now, I'll remove these regions from the output file. Looking at the 4 strategies for dealing with missing data with the -M flag, am I correct in thinking that none of those will solve this current problem I'm dealing with? If I understand it correctly, this is because all options deal with missing data at the individual level, not missing SNPs due to assembly/masking issues?

Best, Brock

alachins commented 5 years ago

Thats right. The currently implemented strategies deal with missing data in a SNP, given that there is a SNP entry in the VCF file.

Nikos

alachins commented 4 years ago

The option to exclude any number of regions per chromosome is now fully implemented, as of version 2.3.