DecodeGenetics / chopBAI

Chops a bam index file into pieces
GNU General Public License v2.0
5 stars 1 forks source link

Mapped and placed counts in BAI not reported (samtools idxstats) #11

Closed peterjc closed 8 years ago

peterjc commented 8 years ago

chopBAI fails to populate the BAI psuedo-bins used to record the total number of reads mapped to or placed against each reference, as used by samtools idxstats. These values appears in the references' lists of distinct bins as bin number 37450 (which is beyond the normal range), see https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L1071

To reproduce, using 2df17217b0ed4296858d33de966b76f267eac8d4 (current master branch tip)

$ git reset --hard
$ git checkout 2df17217b0ed4296858d33de966b76f267eac8d4
$ make
$ make test

Note I am using Mac OS X and samtools 1.3 but any recent version should work.

$ samtools --version
samtools 1.3
Using htslib 1.3
Copyright (C) 2015 Genome Research Ltd.

Now for the bug, first let's look at the reads mapped to or placed on each reference:

$ samtools idxstats tests/test.sorted.bam
chrA:B:C:D  300 44  0
chrA:B  10000   2236    0
chrB    80001   17718   0
*   0   0   0

Note that this sample test file has no placed but unmapped reads (i.e. unmapped partners of paired reads which are placed at the same POS/RNAME as their mapped partner for sorting purposes), so the final column is all zeros.

Now looking at one of the windowed BAI files and its BAM symlink,

$ samtools idxstats tests/chrB/test.sorted.bam
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   0   0
*   0   0   0

Expected result would be to have zeros for all the non-selected regions, but the full count for chrB, i.e.

chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   17718   0
*   0   0   0

Likewise for a region of a reference,

$ samtools idxstats "tests/chrB:1-100/test.sorted.bam"
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   0   0
*   0   0   0

Here I would expect the same numbers as if I have extracted reads in this region via samtools view and indexed the sub-file:

$ samtools view -b -o test_chrB_1-100_subset.bam tests/test.sorted.bam "chrB:1-100"
$ samtools index test_chrB_1-100_subset.bam 
$ samtools idxstats test_chrB_1-100_subset.bam 
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   24  0
*   0   0   0

I have tried using chopBAI -l just in case that made any difference, it did not.

peterjc commented 8 years ago

Looking back, samtools idxstats was added in v0.1.8, absent up to and including v0.1.7

$ rm test_chrB_1-100_subset.bam.bai && ./samtools-0.1.8 index test_chrB_1-100_subset.bam && ./samtools-0.1.8 idxstats test_chrB_1-100_subset.bam
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   24  0
*   0   0   0

Versus with an old BAI file without the magic values,

$ rm test_chrB_1-100_subset.bam.bai && ./samtools-0.1.7 index test_chrB_1-100_subset.bam && ./samtools-0.1.8 idxstats test_chrB_1-100_subset.bam
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   0   0
*   0   0   0

i.e. samtools idxstats appears to silently report zeros if the BAI file is missing the counts (rather than warning the index is missing these values).

pmelsted commented 8 years ago

The number of mapped and unmapped reads are stored in the metabin and are computed from the samtools index command when it has all the reads available.

The output chopBai essentially provides a view into the bam file using only information from the original index. From this information alone it is not possible to compute the number of reads within the region.

There are two ways of working with idxstats. The first one is to copy the info, this is bad because it won't match the number of reads you would get with samtools view -c. The second option is to report 0, and explicitly document the fact that idxstats on the chopped index is not correct.

We will most likely go the second route. As for the metabin we are planning to add * as a special region for unmapped reads. This would match the (undocumented) behaviour of samtools to fetch those reads, e.g.

samtools view "*"

returns all unmapped pairs and uses the BAI index., https://twitter.com/mcshane/status/665025704141172736

peterjc commented 8 years ago

For the special case where chopBAI was asked to do an entire reference you could copy the numbers. Is this case common enough to make this exception?

In general, I agree, you can't calculate sensible numbers without scanning the BAM file. Therefore I would omit the magic-bin and say nothing.

(I think samtools ought to say when this data is missing, rather than report zero, but that's not a chopBAI problem).

bkehr commented 8 years ago

Yes, it makes sense to copy the magic bin if you ask for an entire reference. We'll add it for this case and omit the magic bin otherwise.