Closed XqKang closed 1 year ago
Can you make NOC2L.bam available in some way?
Mostly, Rsamtools is filtering out low-quality bases. I focused on the first nucleotide
param <- ScanBamParam(which = GRanges("chr1:944203"))
and changed the min_base_quality
to 0
pilup_params <- PileupParam(
min_base_quality=0,
distinguish_strands = FALSE
)
I then get
> pileup(bamfile, scanBamParam=param, pileupParam = pilup_params)
seqnames pos nucleotide count which_label
1 chr1 944203 - 1 chr1:944203-944203
2 chr1 944203 C 1 chr1:944203-944203
3 chr1 944203 T 137 chr1:944203-944203
which is much better agreement (139 reads, versus 140 for samtools / pysam)! Or more generally
pileup(bamfile, pileupParam = pilup_params) |>
dplyr::as_tibble() |>
dplyr::filter(pos >= 944203) |>
tidyr::pivot_wider(names_from = "nucleotide", values_from = "count") |>
dplyr::rowwise() |>
dplyr::mutate(coverage = sum(`-`, C, T, A, G, na.rm = TRUE))
## # A tibble: 3,246 × 8
## # Rowwise:
## seqnames pos `-` C T A G coverage
## <fct> <int> <int> <int> <int> <int> <int> <int>
## 1 chr1 944203 1 1 137 NA NA 139
## 2 chr1 944204 5 136 4 NA NA 145
## 3 chr1 944205 5 1 5 136 NA 147
## 4 chr1 944206 10 3 7 132 NA 152
## 5 chr1 944207 11 136 3 3 NA 153
## 6 chr1 944208 5 1 13 138 NA 157
## 7 chr1 944209 2 158 2 NA NA 162
## 8 chr1 944210 7 NA 5 152 NA 164
## 9 chr1 944211 8 NA 2 142 13 165
## 10 chr1 944212 4 1 182 NA NA 187
## # ℹ 3,236 more rows
## # ℹ Use `print(n = ...)` to see more rows
One read still seems to have gone missing. I notice using GenomicAlignments
galn <- readGAlignments(
bamfile,
param = ScanBamParam(which = GRanges("chr1:944203"), what = "qname")
)
length(galn) # 140
olaps <- countOverlaps(galn, GRanges("chr1:944203"))
sum(olaps) # 139
galn[olaps == 0]
## GAlignments object with 1 alignment and 1 metadata column:
## seqnames strand cigar qwidth start end width njunc | qname
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer> <integer> | <character>
## [1] chr1 - 6S6M3463N1...6N18M1I35M 648 940749 945109 4361 2 | a4a30d08-be53-4b4f-a..
## -------
## seqinfo: 555 sequences from an unspecified genome
so I will continue to investigate.
One read still seems to have gone missing. I notice using GenomicAlignments
That's because the cigar contains two N runs, which introduces two gaps.
Here are the ranges covered by galn[olaps == 0]
:
> grglist(galn[olaps == 0])
GRangesList object of length 1:
[[1]]
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 940749-940754 -
[2] chr1 944218-944800 -
[3] chr1 945057-945109 -
-------
seqinfo: 555 sequences from an unspecified genome
so position chr1:944203
is not covered.
Note that findOverlaps()
and family replace the GAlignments subject with grglist(subject)
internally. This is consistent with coverage()
on a GAlignments object, which also passes the object thru grglist()
internally:
> coverage(galn)$chr1[944203]
integer-Rle of length 1 with 1 run
Lengths: 1
Values : 139
If one doesn't want this semantic (i.e. if they want the gaps to contribute to the coverage), then they can replace the GAlignments object with a GRanges object (using granges()
):
> sum(countOverlaps(granges(galn), GRanges("chr1:944203")))
[1] 140
> coverage(granges(galn))$chr1[944203]
integer-Rle of length 1 with 1 run
Lengths: 1
Values : 140
So I suppose that this is the semantic used by samtools / pysam?
Describe the bug I have mRNA sequencing data that I've aligned to a genome. Specifically, I am interested in determining the total number of reads at each base and the percentage of occurrences of A, T, G, C, deletions, and insertions at these bases. I have utilized both Rsamtools pileup and Pysam pileup to achieve this. However, I noticed that their outputs are inconsistent, even though I used the same BAM file as input for both tools. To double check which one is correct. I use samtools on linux to slice out specific one base and found out the totaly number of read align to that base is the same as Pysam output. Not the Rsamtools output. Below is the code I implemented:
To Reproduce Rsamtools: param <- ScanBamParam(which=GRanges(strand = "-", seqnames = "chr1", ranges = IRanges(start=944203, end=959309))) pilup_params = Rsamtools::PileupParam(max_depth = 102500, min_mapq = 20, distinguish_nucleotides = T, ignore_query_Ns = F, include_insertions = T, distinguish_strands = F) bam.pileup = pileup(bamfile, scanBamParam=param, pileupParam = pilup_params) bam.pileup = bam.pileup[order(bam.pileup$pos), ]
Pysam: bamfile = AlignmentFile(bam_dir, "rb") pileups = bamfile.pileup('chr1', 944202, 959308, truncate=True, min_mapping_quality=20) columns=['chr', 'position', "coverage", 'A', 'T', 'G', 'C', '-', '+'] df = pd.DataFrame(columns=columns) for pileupcolumn in pileups: data = [] for pileupread in pileupcolumn.pileups: if pileupread.is_del: data.append('-') elif pileupread.indel: data.append('+') else: data.append(pileupread.alignment.query_sequence[pileupread.query_position]) base_counts = Counter(data)
row = { 'chr': 'chr1', 'position': pileupcolumn.pos+1, 'coverage': pileupcolumn.n, 'A': base_counts['A'], 'T': base_counts['T'], 'G': base_counts['G'], 'C': base_counts['C'], '-': base_counts['-'], '+': base_counts['+'] } df = df.append(row, ignore_index=True)
Here is the samtools command I used on linux samtools view -h -F 256 -F 4 -F 2048 -bq 20 bamfile.bam chr:944202-959303 >NOC2L.bam samtools view -c NOC2L.bam
The samtools linux command output 146 total at chr:944202-959303 which is the same as Pysam. But Rsamtools output 82 reads.
And here is the screen shot of Pysam and Rsamtools output