Closed jayoung closed 5 years ago
Seems like they should include the sequence information given that there was an attempt to map the reads, but I guess that's irrelevant.
Sorry - that wasn't clear. There's been no attempt to map the reads yet - they're just plain reads, and they're still stored as bam files. No reference genome involved yet. I think PacBio is choosing bam because it lets them capture more information for each read than fastq would allow.
Not sure if this is relevant, but the datatype I'm working with is PacBio's CCS consensus sequences- I believe those will be coming as bam files from now on. I'm not 100% sure, but I think the single pass reads will be bams, too.
Other example datasets, of various types, here: https://github.com/PacificBiosciences/DevNet/wiki/Datasets
I'm interested in maintaining the bam format, partly because after some processing in R, I'll use the BLASR tool to map reads to the genome (https://github.com/PacificBiosciences/blasr), and BLASR prefers to have bam files as input.
For now I'm using command-line samtools to convert bam to headerless-sam, and using sam as a tab-delimited input/output file for R processing, and then converting back to bam again on the command line. It's working fine, but it'd be nice to skip those command-line steps.
FWIW I used a lighter-weight example
url = "https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/consensusreads/m64011_181218_235052.consensusreads.bam"
bf = BamFile(url, yieldSize = 100)
seqinfo(bf)
open(bf)
res <- scanBam(bf)
while waiting for the full 11Gb BAM file to download...
Thanks, Martin - that was quick! Will that be in the release branch, or just devel for now? Miss you in Seattle!
I thought of this as a 'feature' rather than 'bug fix' so it's in devel.
yes, that makes sense - thanks again
hi there,
I'm beginning to look at some PacBio CCS data. It looks like PacBio read data now come as bam files, even for unmapped reads:
https://pacbiofileformats.readthedocs.io/en/5.1/index.html https://pacbiofileformats.readthedocs.io/en/5.1/BAM.html
I'd love to be able to read in this bam file to R directly (and include some of those tags, much like scanBam does for bam files of mapped reads).
scanBam seems like an obvious choice, but it won't read the PacBio bam files because the headers don't contain @SQ lines (because the reads haven't yet been mapped to a genome). Would it be possible to remove this restriction on scanBam, that reads should have been mapped to a genome and therefore have @SQ lines in the header? The asSam function seems to have the same restriction (but command-line 'samtools view' works fine on these files).
I realize that most of the fields usually returned by scanBam are not relevant for unmapped reads, but the infrastructure it provides for reading in the bam file and parsing extra tags seems really useful here. I know as a workaround I can use samtools on the command line and then 'scan', or convert the bam to fastq format, but using the bam file directly would be great in future.
An example of what I'd like to do is below.
thanks for considering!
Janet