Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

request to add support for pacbio unaligned bam files? #15

Closed jayoung closed 5 years ago

jayoung commented 5 years ago

hi there,

I'm beginning to look at some PacBio CCS data. It looks like PacBio read data now come as bam files, even for unmapped reads:
https://pacbiofileformats.readthedocs.io/en/5.1/index.html https://pacbiofileformats.readthedocs.io/en/5.1/BAM.html

I'd love to be able to read in this bam file to R directly (and include some of those tags, much like scanBam does for bam files of mapped reads).

scanBam seems like an obvious choice, but it won't read the PacBio bam files because the headers don't contain @SQ lines (because the reads haven't yet been mapped to a genome). Would it be possible to remove this restriction on scanBam, that reads should have been mapped to a genome and therefore have @SQ lines in the header? The asSam function seems to have the same restriction (but command-line 'samtools view' works fine on these files).

I realize that most of the fields usually returned by scanBam are not relevant for unmapped reads, but the infrastructure it provides for reading in the bam file and parsing extra tags seems really useful here. I know as a workaround I can use samtools on the command line and then 'scan', or convert the bam to fastq format, but using the bam file directly would be great in future.

An example of what I'd like to do is below.

thanks for considering!

Janet

## get an example file
wget https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/consensusreads/m64011_181218_235052.consensusreads.bam
wget https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/consensusreads/m64011_181218_235052.consensusreads.bam.bai    

## start R
R
library(Rsamtools)

bam <- scanBam("m64011_181218_235052.consensusreads.bam")
[samopen] no @SQ lines in the header.
Error in value[[3L]](cond) : 
  failed to open BamFile: SAM/BAM header missing or empty
  file: 'm64011_181218_235052.consensusreads.bam'

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Rsamtools_2.2.0      Biostrings_2.54.0    XVector_0.24.0      
[4] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0  IRanges_2.18.1      
[7] S4Vectors_0.22.0     BiocGenerics_0.32.0 

loaded via a namespace (and not attached):
[1] zlibbioc_1.30.0        compiler_3.6.1         tools_3.6.1           
[4] GenomeInfoDbData_1.2.1 RCurl_1.95-4.12        BiocParallel_1.18.0   
[7] bitops_1.0-6          
lawremi commented 5 years ago

Seems like they should include the sequence information given that there was an attempt to map the reads, but I guess that's irrelevant.

jayoung commented 5 years ago

Sorry - that wasn't clear. There's been no attempt to map the reads yet - they're just plain reads, and they're still stored as bam files. No reference genome involved yet. I think PacBio is choosing bam because it lets them capture more information for each read than fastq would allow.

Not sure if this is relevant, but the datatype I'm working with is PacBio's CCS consensus sequences- I believe those will be coming as bam files from now on. I'm not 100% sure, but I think the single pass reads will be bams, too.

Other example datasets, of various types, here: https://github.com/PacificBiosciences/DevNet/wiki/Datasets

I'm interested in maintaining the bam format, partly because after some processing in R, I'll use the BLASR tool to map reads to the genome (https://github.com/PacificBiosciences/blasr), and BLASR prefers to have bam files as input.

For now I'm using command-line samtools to convert bam to headerless-sam, and using sam as a tab-delimited input/output file for R processing, and then converting back to bam again on the command line. It's working fine, but it'd be nice to skip those command-line steps.

mtmorgan commented 5 years ago

FWIW I used a lighter-weight example

url = "https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/consensusreads/m64011_181218_235052.consensusreads.bam"

bf = BamFile(url, yieldSize = 100)

seqinfo(bf)
open(bf)
res <- scanBam(bf)

while waiting for the full 11Gb BAM file to download...

jayoung commented 5 years ago

Thanks, Martin - that was quick! Will that be in the release branch, or just devel for now? Miss you in Seattle!

mtmorgan commented 5 years ago

I thought of this as a 'feature' rather than 'bug fix' so it's in devel.

jayoung commented 5 years ago

yes, that makes sense - thanks again