biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

sambamba index incompatible with htslib for HTS_IDX_NOCOOR ("*" contig) #322

Closed ghost closed 6 years ago

ghost commented 6 years ago

Problem:

How to replicate:

sambamba --version

sambamba 0.6.6

This version was built with:
    LDC 1.1.0
    using DMD v2.071.2
    using LLVM 3.9.1
    bootstrapped with LDC - the LLVM D compiler (0.17.1)

samtools --version

samtools 1.3-10-gc613147
Using htslib 1.3-22-gacc0763
Copyright (C) 2016 Genome Research Ltd.

g++ --std=c++14 read_nocoor.cpp -o rno /usr/local/lib/libhts.dylib /usr/lib/libz.dylib (You may need to adjust for your local install)

python3 sambamba_bug.py

samtools index small_test.bam

./rno
Found 1 reads in region *

sambamba index small_test.bam

./rno
Found 0 reads in region *

Possibly important clue

If there are ONLY NOCOOR reads in the BAM file then this bug does not appear.

Code

read_nocoor.cpp

/*
  Bare minimum program to open a BAM and print out the reads from the NOCOOR
  region.

  compile with

  g++ --std=c++14 read_nocoor.cpp -o rno /usr/local/lib/libhts.dylib /usr/lib/libz.dylib

  **Modify for your local install of htslib and zlib**
*/
#include <iostream>
#include <stdexcept>
#include <htslib/sam.h>

int main( int argc, char* argv[] )
{
  samFile*       fp;
  bam_hdr_t* header;
  hts_idx_t*    idx;
  hts_itr_t*   iter;

  char fname[] = "small_test.bam";
  //char fname[] = "/Users/kghose/Code/mitty3/sandbox/integration-test5-A.sorted.bam";
  char contig[] = "*";

  fp = sam_open( fname, "rb" );
  if( fp == nullptr) { std::runtime_error("Error opening BAM file"); }

  header = sam_hdr_read( fp );
  if( header == nullptr) { std::runtime_error("Error opening header"); }

  idx = sam_index_load( fp, fname );
  if( idx == nullptr) { std::runtime_error("Error opening index"); }

  iter = sam_itr_querys( idx, header, contig );    
  if( iter == nullptr) { std::runtime_error("Error creating iterator"); }

  bam1_t* this_read;
  size_t read_counts = 0;
  for(;;)
  {
    this_read = bam_init1();
    if( sam_itr_next( fp, iter, this_read ) < 0 )
    {
      bam_destroy1( this_read );
      break; 
    }

    bam_destroy1( this_read );
    read_counts++;
  }

  std::cout << "Found " << read_counts << " reads in region *" << std::endl;

  if( fp != nullptr )
  {
    hts_itr_destroy( iter );
    hts_idx_destroy( idx );
    bam_hdr_destroy( header );  
    sam_close( fp );                
  }
}

sambamba_bug.py

"""
The purpose of this script is to create a minimal test case that demonstrates
that if there are unmapped NOCOOR reads at the end of a BAM, htslib does not find them
if sambamba indexes the file, though samtools index works as expected
"""
import pysam

def create_read(tid, pos):
  a = pysam.AlignedSegment()
  a.query_name = "read_28833_29006_6945"
  a.query_sequence = "A" * 20
  a.flag = 0b1000010 | (0b0 if tid > -1 else 0b100)
  a.reference_id = tid
  a.reference_start = pos
  a.mapping_quality = 20 if tid > -1 else 0
  a.cigar = ((0, 20),)
  a.next_reference_id = 0
  a.next_reference_start = 199
  # a.template_length = 167
  a.query_qualities = pysam.qualitystring_to_array("<" * 20)
  return a

# Code lifted from pysam example in docs
header = {'HD': {'VN': '1.0'},
          'SQ': [{'LN': 1575, 'SN': 'chr1'},
                 {'LN': 1584, 'SN': 'chr2'}]}

fout = 'small_test.bam'

with pysam.AlignmentFile(fout, "wb", header=header) as outf:
  outf.write(create_read(0, 100))
  outf.write(create_read(-1, 100))

outf.close()
pjotrp commented 6 years ago

Thank you for this excellent report! @kghosesbg do you mind signing up to our mailing list if you have not done so? We will discuss sambamba development and priorities. This certainly belongs there!

pjotrp commented 6 years ago

For now, the easy fix is to use samtools. I'll add this to a milestone.

jfarek commented 6 years ago

Hi,

I believe that this issue has been resolved in release v0.6.7. This issue stems from BAI pseudo-bin calculation in indexing.d from BioD. The BioD issue page that covered an incorrect pseudo-bin calculation in indexing.d doesn't appear to exist anymore, but you can see that the latest commit to this file applied the fix that I suggested for that issue. Release v0.6.7 of sambamba appears to have folded in this fix from BioD.

I have not encountered this issue when using sambamba v0.6.7 to index BAMs. You can confirm that this is the case in general.

pjotrp commented 6 years ago

Thanks @jfarek!