Closed lh3 closed 2 years ago
To provide a some progress, I applied the patch but it introduced memory corruption errors, probably because the library code is ancient (pre-htslib). I'll work briefly on patching the current code, but will likely spend the time to update to a much more modern version. This will take some time.
@mtmorgan with samtools/htslib@aea349a, htslib now formally supports long cigars in BAM. I have just added another commit to keep this PR in sync with the final solution, which is slightly different from my original PR. Let me if I can help with anything. I am not familiar with the building system of Bioconductor, but I can learn. Thanks.
This works with current Rhtslib/Rsamtools and has probably be working for a while (since Rsamtools was migrated to Rhtslib in 2018?):
## Download the sample files:
remote_sample_dir <- "http://lh3lh3.users.sourceforge.net/data/cigar-64k/"
bam_file <- "cigar-64k.bam"
bai_file <- paste0(bam_file, ".bai")
local_sample_dir <- tempdir()
url <- paste0(remote_sample_dir, "/", bam_file)
destfile <- paste0(local_sample_dir, "/", bam_file)
download.file(url, destfile)
url <- paste0(remote_sample_dir, "/", bai_file)
destfile <- paste0(local_sample_dir, "/", bai_file)
download.file(url, destfile)
## Load the file in R:
library(Rsamtools)
bam <- scanBam("cigar-64k.bam")
nchar(bam[[1]]$cigar)
# [1] 35349 161357 43346
substr(bam[[1]]$cigar, 1, 50)
# [1] "93S29M2D29M1D19M1D31M3D12M1D5M1D1M1D47M1I7M2D18M1D"
# [2] "87S13M2D4M1D21M2D2M1D6M1D17M2D9M2D36M4D16M1D9M1I4M"
# [3] "1S5M1D6M1D14M4D19M2D21M1D20M1I4M5D11M2D50M1D9M2D5M"
sessionInfo():
R Under development (unstable) (2022-03-24 r81969)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10
Matrix products: default
BLAS: /home/hpages/R/R-4.2.r81969/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.2.r81969/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_2.12.0 Biostrings_2.64.0 XVector_0.36.0
[4] GenomicRanges_1.48.0 GenomeInfoDb_1.32.1 IRanges_2.30.0
[7] S4Vectors_0.34.0 BiocGenerics_0.42.0
loaded via a namespace (and not attached):
[1] crayon_1.5.1 bitops_1.0-7 zlibbioc_1.42.0
[4] BiocParallel_1.30.0 tools_4.2.0 RCurl_1.98-1.6
[7] parallel_4.2.0 compiler_4.2.0 GenomeInfoDbData_1.2.8
This PR enlarges
bam1_core_t::n_cigar
to 32 bits. It is a necessary ABI change to hold cigars longer than 65535 in memory. For typical BAM records, the PR changes nothing. For a record with >65535 cigar operators:When writing SAM, the PR has no effect. SAM naturally works with long cigars.
When writing BAM to disk, the PR writes a fake cigar
<readLength>S
at the original cigar position and puts the real cigar at theCG:B,I
(binary array type) at the end.When reading from BAM into memory, the PR calls
bam_tag2cigar()
to move the real cigar back to the right position.When reading from SAM into memory: a) if SAM uses real cigar, read normally; b) if SAM puts the real cigar at the CG tag (e.g. output by
minimap2 -L
or generated by older tools unaware of the CG tag), the PR reads normally and then moves the real cigar back withbam_tag2cigar()
.To endusers and most developers, these operations are seamless. They won't notice the existence of the CG tag. All APIs stay the same, too.
I am not familiar with the bioc build system, so I have not tested the code on real data. Nonetheless, this PR is very close to the change to htslib. It should not have big issues, I believe. You can find sample files here. Your help on testing will be much appreciated. By the way, the precompiled samtools binary there already supports long cigars. It may help testing.