Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

What pileup options to use for nanopore sequencing data? #18

Closed kac2053 closed 3 years ago

kac2053 commented 3 years ago

Questions: Is there any way to index which reads contribute to pileup count data? What options should we use to run pileup on Nanopore sequencing data (particularly for quality score option)? Are there any other filters that pileup implements by default (other than minimum base quality [13], minimum mapping quality [0], and max depth [8000])?

Description: We have previously published data using Rsamtools pileup to generate read counts overlapping mutation sites. We are now using a tool called sam2tsv to index which read IDs belong to which counts. However when I count up the number of reads to the mutation in the sam2tsv data, the counts are not matching up with pileup counts despite filtering sam2tsv data based on pileup default filters and additional filters I applied in the command.

The quality scores in Nanopore do not fully match the ASCII_BASE 33 and 64 scores. Quality score table: https://www.drive5.com/usearch/manual/quality_score.html Our data quality scores partially follow ASCII 33 but you can see in the table that the score stops at "K = 42" being the highest score. However, our data continues all the way down to Z, with 57 being the highest score.

What options should we use in pileup to take into account Nanopore data?

Commands:

pileup( bf, a2g.snp,
                   scanBamParam=ScanBamParam( flag = scanBamFlag( isDuplicate=F ), 
                                              which = a2g.snp ),
                   pileupParam=PileupParam( distinguish_strands=F) )
pileup( bf, a2g.snp,
                   scanBamParam=ScanBamParam( flag = scanBamFlag( isDuplicate=F ),
                                              which = a2g.snp ),
                   pileupParam=PileupParam( distinguish_strands=F),
                   phred2ASCIIOffset(scheme= c("Sanger")))

The second command was based on reading in a forum that the default is ASCII 64, so I tried to change it to 33 by writing phred2ASCIIOffset(scheme= c("Sanger")) even though Nanopore only partially matches ASCII 33. Both commands seem to give the same output.

My expertise is not in bioinformatics, so your feedback for how to use pileup on Nanopore data would be greatly appreciated.

lshep commented 3 years ago

Could you please ask these types of questions on the support.bioconductor.org site. You will get more answers there.