StephanHolgerD / BGZblocksfrombam

0 stars 1 forks source link

Bug: does not work for chromosomes 1-3 #2

Open berntpopp opened 2 months ago

berntpopp commented 2 months ago

chromosomes 1-3 have a difefrent block structure I think and thus break the code:

  File "/mnt/c/development/varvis-download/development/bamnostic/extract_bam_subset.py", line 111, in <module>
    extract_bam_region_to_file(bam_file, bai_file, chromosome, start_coords, end_coords, output_file)
  File "/mnt/c/development/varvis-download/development/bamnostic/extract_bam_subset.py", line 80, in extract_bam_region_to_file
    first_block, middle_blocks, last_block = extract_bam_region(b_idx, bam_file, ref_id, start_coords, end_coords, padd=100000)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/development/varvis-download/development/bamnostic/extract_bam_subset.py", line 39, in extract_bam_region
    last_block = process_last_block(chunk2, end_startoff)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/development/varvis-download/development/bamnostic/extract_bam_subset.py", line 56, in process_last_block
    frst_blck_end = chunk2[:values[0][1]]
berntpopp commented 2 months ago

https://cmdcolin.github.io/bam_index_visualizer/

berntpopp commented 2 months ago

In the BAI format, each bin may span 229, 226, 223, 220, 217 or 214 bp. Bin 0 spans a 512Mbp region, bins 1–8 span 64Mbp, 9–72 8Mbp, 73–584 1Mbp, 585–4680 128kbp, and bins 4681–37448 span 16kbp regions. This implies that this index format does not support reference chromosome sequences longer than 229 − 1.

https://samtools.github.io/hts-specs/SAMv1.pdf

StephanHolgerD commented 2 months ago

I tested with regions on chr1, chr2 and chr4. Will check you pr, can we create some test data ?