alachins / raisd

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
33 stars 13 forks source link

Most of the sites are discarded #15

Closed palomo11 closed 4 years ago

palomo11 commented 4 years ago

Hi,

I have analysed 12 bacterial populations. I have been able to get the µ statistics and the Manhattan plot, but when I look into the sites and SNP retained, I can see that most of the sites are discarded. See a couple of examples below:

Command: RAiSD -n Genome1_D -I Genome1.vcf -f -y 1 -P -D

 Index: Name | Sites = SNPs + Discarded | Discarded = HeaderCheckFailed + MAFCheckFailed + WithMissing + Monomorphic

 0: Genome1_1 | 9489 = 260 + 9229 | 9229 = 1494 + 0 + 7438 + 297
 1: Genome1_2 | 37578 = 1479 + 36099 | 36099 = 5846 + 0 + 28710 + 1543
 2: Genome1_3 | 9806 = 219 + 9587 | 9587 = 1644 + 0 + 7743 + 200
 3: Genome1_4 | 685 = 21 + 664 | 664 = 127 + 0 + 502 + 35
 4: Genome1_5 | 115 = 0 + 115 | 115 = 28 + 0 + 87 + 0
 5: Genome1_6 | 278 = 16 + 262 | 262 = 71 + 0 + 172 + 19
 6: Genome1_7 | 9407 = 261 + 9146 | 9146 = 1473 + 0 + 7384 + 289
 7: Genome1_8 | 1309 = 40 + 1269 | 1269 = 204 + 0 + 1044 + 21
 8: Genome1_9 | 29073 = 866 + 28207 | 28207 = 4618 + 0 + 22691 + 898
 9: Genome1_10 | 184 = 12 + 172 | 172 = 40 + 0 + 127 + 5
 10: Genome1_11 | 337 = 8 + 329 | 329 = 67 + 0 + 253 + 9
 11: Genome1_12 | 10365 = 270 + 10095 | 10095 = 1641 + 0 + 8200 + 254
 12: Genome1_13 | 17598 = 508 + 17090 | 17090 = 2773 + 0 + 13780 + 537
 ...
 26: Genome1_27 | 95 = 8 + 87 | 87 = 16 + 0 + 70 + 1

Another example:

Command: RAiSD -n Genome13_D -I Genome13.vcf -f -y 1 -P -D

 Index: Name | Sites = SNPs + Discarded | Discarded = HeaderCheckFailed + MAFCheckFailed + WithMissing + Monomorphic

 0: Genome13_1 | 3381 = 18 + 3363 | 3363 = 434 + 0 + 2904 + 25
 1: Genome13_2 | 3683 = 35 + 3648 | 3648 = 599 + 0 + 3012 + 37
 2: Genome13_3 | 2803 = 10 + 2793 | 2793 = 432 + 0 + 2351 + 10
 3: Genome13_4 | 3169 = 33 + 3136 | 3136 = 435 + 0 + 2653 + 48
 4: Genome13_5 | 20165 = 183 + 19982 | 19982 = 2796 + 0 + 16918 + 268
 5: Genome13_6 | 926 = 2 + 924 | 924 = 150 + 0 + 774 + 0
 6: Genome13_7 | 13970 = 112 + 13858 | 13858 = 2066 + 0 + 11633 + 159
 7: Genome13_8 | 1018 = 0 + 1018 | 1018 = 182 + 0 + 836 + 0
 8: Genome13_9 | 15093 = 96 + 14997 | 14997 = 2095 + 0 + 12754 + 148
...
 56: Genome13_57 | 41 = 0 + 41 | 41 = 12 + 0 + 26 + 3

Do you know why most of the sites are discarded? Why the: failed "header" check could happen? and what does exactly mean: sites with missing data?

Thanks in advance.

alachins commented 4 years ago

Hello, The header-check fails if something is not right in the REF, ALT, or GT fields per VCF line. Sites with missing data are those VCF lines that have missing genotypes. They contain entries such as "./." or "." You can use one of the missing-data strategies implemented in RAiSD to include such sites in the analysis.