brentp / mosdepth

fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing
MIT License
702 stars 100 forks source link

misleading output in `summary.txt` file #198

Open nservant opened 1 year ago

nservant commented 1 year ago

Hi,

We had an issue which might be interesting for you (and other users). Whe have a WES sample, and after a bug on a storage system, we only get reads on the chr1 after doing an intersectBed of the bam with the bed target file. Nothing related to mostdepth so far.

However, when we compute the coverage of this bam file with ;

mosdepth -t 4 -n --quantize 0:1:10:50:100: --by Twist_Exome_Core_Covered_Targets_hg38.bed PREFIX onTarget.bam

It reports a very high total mean coverage (181.34X) in the summary file (grep total_region)

chrom   length  bases   mean    min max
chr1    248956422   937505185   3.77    0   2887
chr1_region 3358574 609028999   181.34  0   2887
total   248956422   937505185   3.77    0   2887
total_region    3358574 609028999   181.34  0   2887

So to me this is misleading, as the total_region does not seem to take into account the full bed, but only the non-zero chromosomes.

Of note, this is confirmed using the regions.bed.gz file but the 'real' mean coverage should be 18.36X.

>>zcat regions.bed.gz | awk '$1=="chr1"{ z=($3-$2); s=s+z*$4; c=c+z}END{print s/c}'
181.336
>>zcat test.regions.bed.gz | awk '{ z=($3-$2); s=s+z*$4; c=c+z}END{print s/c}'
18.3644
nservant commented 1 year ago

the issue is also true for the file mosdepth.region.dist.txt

brentp commented 1 year ago

Thanks for the clear report. I see what you mean. Due to the way I wrote that part, it's not an easy fix. I'll have a closer look next week but it might be that this will remain a (now) known issue.

nmflack commented 10 months ago

Thank you @nservant for documenting this issue, I've run into the same unfortunately.

Any progress on a fix? If not, can you note this behavior more prominentely in the README as a known issue? It's not a "known" issue otherwise.

I'm using mosdepth to evaluate regional enrichment from Oxford Nanopore adaptive sampling. The difference between total_region and manual calculation from region.bed.gz changed our wet bench plans for a re-run of valuable samples. I sent this issue to colleagues who have been relying on tailing summary.txt as well.

We're running version 0.3.6 from bioconda. Thank you, really appreciate the speed and ease of use of the package for this application.