Rsamtools pileup and Pysam pileup output discrepancy

XqKang commented 1 year ago

Describe the bug I have mRNA sequencing data that I've aligned to a genome. Specifically, I am interested in determining the total number of reads at each base and the percentage of occurrences of A, T, G, C, deletions, and insertions at these bases. I have utilized both Rsamtools pileup and Pysam pileup to achieve this. However, I noticed that their outputs are inconsistent, even though I used the same BAM file as input for both tools. To double check which one is correct. I use samtools on linux to slice out specific one base and found out the totaly number of read align to that base is the same as Pysam output. Not the Rsamtools output. Below is the code I implemented:

To Reproduce Rsamtools: param <- ScanBamParam(which=GRanges(strand = "-", seqnames = "chr1", ranges = IRanges(start=944203, end=959309))) pilup_params = Rsamtools::PileupParam(max_depth = 102500, min_mapq = 20, distinguish_nucleotides = T, ignore_query_Ns = F, include_insertions = T, distinguish_strands = F) bam.pileup = pileup(bamfile, scanBamParam=param, pileupParam = pilup_params) bam.pileup = bam.pileup[order(bam.pileup$pos), ]

Pysam: bamfile = AlignmentFile(bam_dir, "rb") pileups = bamfile.pileup('chr1', 944202, 959308, truncate=True, min_mapping_quality=20) columns=['chr', 'position', "coverage", 'A', 'T', 'G', 'C', '-', '+'] df = pd.DataFrame(columns=columns) for pileupcolumn in pileups: data = [] for pileupread in pileupcolumn.pileups: if pileupread.is_del: data.append('-') elif pileupread.indel: data.append('+') else: data.append(pileupread.alignment.query_sequence[pileupread.query_position]) base_counts = Counter(data)
row = { 'chr': 'chr1', 'position': pileupcolumn.pos+1, 'coverage': pileupcolumn.n, 'A': base_counts['A'], 'T': base_counts['T'], 'G': base_counts['G'], 'C': base_counts['C'], '-': base_counts['-'], '+': base_counts['+'] } df = df.append(row, ignore_index=True)

Here is the samtools command I used on linux samtools view -h -F 256 -F 4 -F 2048 -bq 20 bamfile.bam chr:944202-959303 >NOC2L.bam samtools view -c NOC2L.bam

The samtools linux command output 146 total at chr:944202-959303 which is the same as Pysam. But Rsamtools output 82 reads.

And here is the screen shot of Pysam and Rsamtools output

pysamoutput Rsamtools

mtmorgan commented 1 year ago

Can you make NOC2L.bam available in some way?

XqKang commented 1 year ago

NOC2L_sorted.zip

mtmorgan commented 1 year ago

Mostly, Rsamtools is filtering out low-quality bases. I focused on the first nucleotide

param <- ScanBamParam(which = GRanges("chr1:944203"))

and changed the min_base_quality to 0

pilup_params <- PileupParam(
    min_base_quality=0,
    distinguish_strands = FALSE
)

I then get

> pileup(bamfile, scanBamParam=param, pileupParam = pilup_params)
  seqnames    pos nucleotide count        which_label
1     chr1 944203          -     1 chr1:944203-944203
2     chr1 944203          C     1 chr1:944203-944203
3     chr1 944203          T   137 chr1:944203-944203

which is much better agreement (139 reads, versus 140 for samtools / pysam)! Or more generally

pileup(bamfile, pileupParam = pilup_params) |>
    dplyr::as_tibble() |>
    dplyr::filter(pos >= 944203) |>
    tidyr::pivot_wider(names_from = "nucleotide", values_from = "count") |>
    dplyr::rowwise() |>
    dplyr::mutate(coverage = sum(`-`, C, T, A, G, na.rm = TRUE))
## # A tibble: 3,246 × 8
## # Rowwise: 
##    seqnames    pos   `-`     C     T     A     G coverage
##    <fct>     <int> <int> <int> <int> <int> <int>    <int>
##  1 chr1     944203     1     1   137    NA    NA      139
##  2 chr1     944204     5   136     4    NA    NA      145
##  3 chr1     944205     5     1     5   136    NA      147
##  4 chr1     944206    10     3     7   132    NA      152
##  5 chr1     944207    11   136     3     3    NA      153
##  6 chr1     944208     5     1    13   138    NA      157
##  7 chr1     944209     2   158     2    NA    NA      162
##  8 chr1     944210     7    NA     5   152    NA      164
##  9 chr1     944211     8    NA     2   142    13      165
## 10 chr1     944212     4     1   182    NA    NA      187
## # ℹ 3,236 more rows
## # ℹ Use `print(n = ...)` to see more rows

One read still seems to have gone missing. I notice using GenomicAlignments

galn <- readGAlignments(
    bamfile,
    param = ScanBamParam(which = GRanges("chr1:944203"), what = "qname")
)
length(galn) # 140

olaps <- countOverlaps(galn, GRanges("chr1:944203"))
sum(olaps)   # 139

galn[olaps == 0]
## GAlignments object with 1 alignment and 1 metadata column:
##       seqnames strand                   cigar    qwidth     start       end     width     njunc |                  qname
##          <Rle>  <Rle>             <character> <integer> <integer> <integer> <integer> <integer> |            <character>
##   [1]     chr1      - 6S6M3463N1...6N18M1I35M       648    940749    945109      4361         2 | a4a30d08-be53-4b4f-a..
##   -------
##   seqinfo: 555 sequences from an unspecified genome

so I will continue to investigate.

hpages commented 1 year ago

One read still seems to have gone missing. I notice using GenomicAlignments

That's because the cigar contains two N runs, which introduces two gaps.

Here are the ranges covered by galn[olaps == 0]:

> grglist(galn[olaps == 0])
GRangesList object of length 1:
[[1]]
GRanges object with 3 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 940749-940754      -
  [2]     chr1 944218-944800      -
  [3]     chr1 945057-945109      -
  -------
  seqinfo: 555 sequences from an unspecified genome

so position chr1:944203 is not covered.

Note that findOverlaps() and family replace the GAlignments subject with grglist(subject) internally. This is consistent with coverage() on a GAlignments object, which also passes the object thru grglist() internally:

> coverage(galn)$chr1[944203]
integer-Rle of length 1 with 1 run
  Lengths:   1
  Values : 139

If one doesn't want this semantic (i.e. if they want the gaps to contribute to the coverage), then they can replace the GAlignments object with a GRanges object (using granges()):

> sum(countOverlaps(granges(galn), GRanges("chr1:944203")))
[1] 140
> coverage(granges(galn))$chr1[944203]
integer-Rle of length 1 with 1 run
  Lengths:   1
  Values : 140

So I suppose that this is the semantic used by samtools / pysam?

Bioconductor / Rsamtools

Rsamtools pileup and Pysam pileup output discrepancy #59