Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

support BAMs with >65535 CIGAR operators #1

Closed lh3 closed 2 years ago

lh3 commented 7 years ago

This PR enlarges bam1_core_t::n_cigar to 32 bits. It is a necessary ABI change to hold cigars longer than 65535 in memory. For typical BAM records, the PR changes nothing. For a record with >65535 cigar operators:

To endusers and most developers, these operations are seamless. They won't notice the existence of the CG tag. All APIs stay the same, too.

I am not familiar with the bioc build system, so I have not tested the code on real data. Nonetheless, this PR is very close to the change to htslib. It should not have big issues, I believe. You can find sample files here. Your help on testing will be much appreciated. By the way, the precompiled samtools binary there already supports long cigars. It may help testing.

mtmorgan commented 7 years ago

To provide a some progress, I applied the patch but it introduced memory corruption errors, probably because the library code is ancient (pre-htslib). I'll work briefly on patching the current code, but will likely spend the time to update to a much more modern version. This will take some time.

lh3 commented 7 years ago

@mtmorgan with samtools/htslib@aea349a, htslib now formally supports long cigars in BAM. I have just added another commit to keep this PR in sync with the final solution, which is slightly different from my original PR. Let me if I can help with anything. I am not familiar with the building system of Bioconductor, but I can learn. Thanks.

hpages commented 2 years ago

This works with current Rhtslib/Rsamtools and has probably be working for a while (since Rsamtools was migrated to Rhtslib in 2018?):

## Download the sample files:
remote_sample_dir <- "http://lh3lh3.users.sourceforge.net/data/cigar-64k/"
bam_file <- "cigar-64k.bam"
bai_file <- paste0(bam_file, ".bai")

local_sample_dir <- tempdir()
url <- paste0(remote_sample_dir, "/", bam_file)
destfile <- paste0(local_sample_dir, "/", bam_file)
download.file(url, destfile)
url <- paste0(remote_sample_dir, "/", bai_file)
destfile <- paste0(local_sample_dir, "/", bai_file)
download.file(url, destfile)

## Load the file in R:
library(Rsamtools)
bam <- scanBam("cigar-64k.bam")
nchar(bam[[1]]$cigar)
# [1]  35349 161357  43346
substr(bam[[1]]$cigar, 1, 50)
# [1] "93S29M2D29M1D19M1D31M3D12M1D5M1D1M1D47M1I7M2D18M1D"
# [2] "87S13M2D4M1D21M2D2M1D6M1D17M2D9M2D36M4D16M1D9M1I4M"
# [3] "1S5M1D6M1D14M4D19M2D21M1D20M1I4M5D11M2D50M1D9M2D5M"

sessionInfo():

R Under development (unstable) (2022-03-24 r81969)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.2.r81969/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.2.r81969/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Rsamtools_2.12.0     Biostrings_2.64.0    XVector_0.36.0      
[4] GenomicRanges_1.48.0 GenomeInfoDb_1.32.1  IRanges_2.30.0      
[7] S4Vectors_0.34.0     BiocGenerics_0.42.0 

loaded via a namespace (and not attached):
[1] crayon_1.5.1           bitops_1.0-7           zlibbioc_1.42.0       
[4] BiocParallel_1.30.0    tools_4.2.0            RCurl_1.98-1.6        
[7] parallel_4.2.0         compiler_4.2.0         GenomeInfoDbData_1.2.8