cpockrandt / genmap

GenMap - Fast and Exact Computation of Genome Mappability
Other
100 stars 18 forks source link

interpretation of output values #24

Open edg1983 opened 2 years ago

edg1983 commented 2 years ago

Hi,

I've used your pre-compiled index files to compute mappability with -K 150 assuming this is a good approach to compute expected mappability for 150bp reads sequencing (I've tried also -K 100 and -K 75 and the considerations below still valid).

In the resulting BED file, I see that computed values have a range 0-0.5 or 1, with no values between 0.5 and 1. Is this expected? Are the output values actual mappability values so lower values correspond to regions difficult to map? In this case, why there are no values between 0.5 and 1?

If low values are associated with mapping problems and the computed values are correct (thus most values are < 0.5), any suggestion on a threshold to define difficult-to-map regions for variant filtering?

Thanks!

Edoardo

cpockrandt commented 2 years ago

Hi @edg1983,

yes, it is correct, that there are no values between 1.0 and 0.5. The mappability value is the multiplicative inverse of the number of occurrences of a k-mer. A value of 1.0 means it is unique in the genome, 0.5 means it occurs twice, and 0.33 means it occurs three times in the genome.

So your assumption is correct: lower values represent regions that are more repetitive, hence more difficult to map.

I don't have a magic threshold number, but the section on Mappability and SNP calling might be of interest for you.

Christopher