Closed peterjc closed 8 years ago
Looking back, samtools idxstats
was added in v0.1.8, absent up to and including v0.1.7
$ rm test_chrB_1-100_subset.bam.bai && ./samtools-0.1.8 index test_chrB_1-100_subset.bam && ./samtools-0.1.8 idxstats test_chrB_1-100_subset.bam
chrA:B:C:D 300 0 0
chrA:B 10000 0 0
chrB 80001 24 0
* 0 0 0
Versus with an old BAI file without the magic values,
$ rm test_chrB_1-100_subset.bam.bai && ./samtools-0.1.7 index test_chrB_1-100_subset.bam && ./samtools-0.1.8 idxstats test_chrB_1-100_subset.bam
chrA:B:C:D 300 0 0
chrA:B 10000 0 0
chrB 80001 0 0
* 0 0 0
i.e. samtools idxstats
appears to silently report zeros if the BAI file is missing the counts (rather than warning the index is missing these values).
The number of mapped and unmapped reads are stored in the metabin and are computed from the samtools index
command when it has all the reads available.
The output chopBai essentially provides a view into the bam file using only information from the original index. From this information alone it is not possible to compute the number of reads within the region.
There are two ways of working with idxstats. The first one is to copy the info, this is bad because it won't match the number of reads you would get with samtools view -c
. The second option is to report 0
, and explicitly document the fact that idxstats on the chopped index is not correct.
We will most likely go the second route. As for the metabin we are planning to add *
as a special region for unmapped reads. This would match the (undocumented) behaviour of samtools to fetch those reads, e.g.
samtools view "*"
returns all unmapped pairs and uses the BAI index., https://twitter.com/mcshane/status/665025704141172736
For the special case where chopBAI was asked to do an entire reference you could copy the numbers. Is this case common enough to make this exception?
In general, I agree, you can't calculate sensible numbers without scanning the BAM file. Therefore I would omit the magic-bin and say nothing.
(I think samtools ought to say when this data is missing, rather than report zero, but that's not a chopBAI problem).
Yes, it makes sense to copy the magic bin if you ask for an entire reference. We'll add it for this case and omit the magic bin otherwise.
chopBAI fails to populate the BAI psuedo-bins used to record the total number of reads mapped to or placed against each reference, as used by
samtools idxstats
. These values appears in the references' lists of distinct bins as bin number 37450 (which is beyond the normal range), see https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L1071To reproduce, using 2df17217b0ed4296858d33de966b76f267eac8d4 (current master branch tip)
Note I am using Mac OS X and samtools 1.3 but any recent version should work.
Now for the bug, first let's look at the reads mapped to or placed on each reference:
Note that this sample test file has no placed but unmapped reads (i.e. unmapped partners of paired reads which are placed at the same POS/RNAME as their mapped partner for sorting purposes), so the final column is all zeros.
Now looking at one of the windowed BAI files and its BAM symlink,
Expected result would be to have zeros for all the non-selected regions, but the full count for
chrB
, i.e.Likewise for a region of a reference,
Here I would expect the same numbers as if I have extracted reads in this region via
samtools view
and indexed the sub-file:I have tried using
chopBAI -l
just in case that made any difference, it did not.