DCGenomics / DangerTrack

MIT License
13 stars 3 forks source link

How to interpret bedGraph files after conversion from bigWig #4

Open sean-la opened 7 years ago

sean-la commented 7 years ago

Hi all,

I've converted the mapability bigWig files to bedGraph format to prepare them for liftover to h38 and I'm doing some manual sanity checks to make sure I converted them properly. They don't look right, however. For example, this is an excerpt of the 50mer mapability data for chromosome 17 of the human genome.

chr17   0       145     1
chr17   145     153     0.5
chr17   153     200     1
chr17   200     208     0.5
chr17   208     881     1
chr17   881     882     0.166667
chr17   882     883     0.333333
chr17   883     884     0.5
chr17   884     886     0.25
chr17   886     887     0.1
chr17   887     890     0.333333
chr17   890     893     0.25
chr17   893     894     0.125
chr17   894     895     0.166667
chr17   895     896     0.125
chr17   896     898     0.166667
chr17   898     900     0.1

If I understand the correctly, the start position and stop positions should have a difference of 50 bp since this from the 50mer mapability file. However, as you can see, many of the start and stop positions have differences that are not equal to 50. Am I interpretting this correctly?

fritzsedlazeck commented 7 years ago

Hi, 50mer mappability means that the kmer is 50bp. So these scores are with respect to the uniqueness of 50bp sequences. Cheers Fritz

sean-la commented 7 years ago

Could you explain what you mean by "these scores are with respect to the uniqueness of the 50bp sequences"?

fritzsedlazeck commented 7 years ago

Sure sorry for the short reply. So mappability is basically coming from the mapping world, where it represents a measurement of how uniquely a read can be placed. For simplicity think about the distance of the best scoring and the second best scoring alignment of a given read in comparison to the reference genome. If that distance is small the mappability will be small, which means its not clear from which of these regions the read is coming from. Now it is more likely to find a 50bp or smaller stretch of sequence identical compared to a e.g. 100bp stretch. From a naive point: 4^50 << 4^100 which is the probability to see the exact sequence (given some naive assumptions)

To make this more comparable and identify hard to map to regions, people computed such mappability tracks. Now that must always be with respect of some read or sequence length. In this case the 50mer mappability means it is computed with respect to a sequence length of 50bp.