arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
140 stars 85 forks source link

coverageBed coverage/number of features discrepancy #61

Open ghost opened 11 years ago

ghost commented 11 years ago

Hello,

I am running coverageBed to determine coverage of a specific regions of a genome by RAD tag data (restriction-site associated DNA reduced representation sequences). I am interested in the coverage for specific regions on 5 chromosomes. The default coverageBed output for this sample (see below) shows that, for the region on chr1, there were 973 features in A that overlapped B. My 'features' are 75 bp long, so if the depth was one this would be a maximum total of 72,975 bp of B covered by A. However, the column for number of bases in B with non-zero coverage by features in A shows 124,757. Why the discrepancy? I would expect the number of bases with non-zero coverage to be less than 72,975.

coverageBed output for my sample:

chr1 12700001 21700000 973 124757 8999999 0.0138619 chr11 12800001 16700000 591 73123 3899999 0.0187495 chr5 4800001 6800000 293 37381 1999999 0.0186905 chr6 13036316 30875458 2514 322205 17839142 0.0180617 chr8 20500001 23600000 391 49935 3099999 0.0161081

My command is like this:

$ coverageBed -a sample.bed -b chr_regions.bed > sample_cov_output.tsv

sample.bed is created from a BAM file using this command:

$ bamToBed -i sample.bam > sample.bed

arq5x commented 11 years ago

Have you checked to see if, by any chance, some of your BAM alignments are spliced? This could lead to an artificial McLain of the coverage if you don't use the -split option when you run bamToBed. Have you looked at the region in IGV or UCSC to see if you can make any sense of why the numbers may be different than you expect? Have you looks at the sample.bed file to make sure that all of the intervals are 75bp?

$ awk '($3 - $2) > 75' sample.bed