bernatgel / karyoploteR

karyoploteR - An R/Bioconductor package to plot arbitrary data along the genome
https://bernatgel.github.io/karyoploter_tutorial/
293 stars 42 forks source link

coverage looks different than IGV #62

Open slowkow opened 4 years ago

slowkow commented 4 years ago

It seems like karyoploteR is computing coverage in a different way than IGV.

I think that IGV is ignoring the reads that have pairs mapping outside the viewing region. In contrast, karyoploteR is including those reads as if they cover those bases.

I'm only interested to know about the actual sequenced bases supported by actual reads. I'm not interested in the gaps in the read alignments, since the gaps do not tell me anything about how well a particular position is covered.

Could I please ask if you might have any hints about how to get the desired behavior?

Here's what I see from karyoploteR:

image

Here's what I see in IGV:

image

Here's my code:

#!/usr/bin/env Rscript

bam_file <- "possorted_genome_bam.bam"

# BiocManager::install(
#   c("TxDb.Hsapiens.UCSC.hg38.knownGene", "GenomicRanges", "karyoploteR", "EnsDb.Hsapiens.v86")
# )

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(GenomicRanges)
library(karyoploteR)
library(magrittr)

txdb_file <- "gencode.v32.annotation.sqlite"
if (!file.exists(txdb_file)) {
  # txdb <- makeTxDbFromGFF(file = "gencode.v32.annotation.gtf3.gz")
  txdb <- makeTxDbFromGFF(
    file = "temp2.gtf.gz",
    format = "gtf",
    organism = "Homo sapiens"
  )
  saveDb(txdb, txdb_file)
} else {
  txdb <- loadDb(txdb_file)
}

gencode_rds <- "gencode.v32.annotation.gff3.rds"
if (!file.exists(gencode_rds)) {
  gencode <- rtracklayer::import.gff3("gencode.v32.annotation.gff3.gz")
  saveRDS(gencode, gencode_rds)
} else {
  gencode <- readRDS(gencode_rds)
}

genes <- gencode[,c("gene_type", "gene_id", "gene_name")] %>% unique()
protein_coding_ids <- genes$gene_id[genes$gene_type == "protein_coding"]
x <- gencode[,c("gene_name", "gene_id")] %>% unique()
ensembl_to_symbol <- x$gene_name
names(ensembl_to_symbol) <- x$gene_id
genes[genes$gene_id == "ENSG00000227766.1",]
genes[genes$gene_id == "ENSG00000206503.13",]

pp <- getDefaultPlotParams(plot.type=1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0

regions <- list(
  "HLA-A" = toGRanges("6:29,941,260-29,945,884")
)
kp <- plotKaryotype(genome = regions[["HLA-A"]], zoom = regions[["HLA-A"]], cex = 1, plot.params = pp)
kpAddBaseNumbers(kp, tick.dist = 10000, minor.tick.dist = 2000,
                 add.units = TRUE, cex=2, tick.len = 3)
genes.data <- makeGenesDataFromTxDb(
  txdb                       = txdb,
  karyoplot                  = kp,
  plot.transcripts           = TRUE,
  plot.transcripts.structure = TRUE
)
protein_coding_ids]
genes.data$genes$name <- ensembl_to_symbol[mcols(genes.data$genes)[,1]]
kpPlotGenes(kp, data = genes.data, gene.name.cex = 2, r0 = 0, r1 = 0.15)

# How can I make this ignore the gaps in reads?
kp <- kpPlotBAMCoverage(kp, data = bam_file, r0 = 0.35, r1 = 1)
computed.ymax <- kp$latest.plot$computed.values$max.coverage
kpAxis(kp, ymin = 0, ymax = computed.ymax, r0 = 0.35, r1 = 1)

kpAddMainTitle(kp, "Coverage", cex = 2)
slowkow commented 4 years ago

I have a feeling that I might need to try modifying this line from kpPlotBAMCoverage():

bam.cov <- bamsignals::bamCoverage(data, karyoplot$plot.region, verbose = FALSE)

Perhaps I should try setting bamCoverage(paired.end = "ignore")?

I'll update this issue if that seems to work.

bernatgel commented 4 years ago

HI @slowkow

When I implemented kpPlotBamCoverage I was working mainly with exome and genome data and that was not an issue. I see your problem and I think it's an important feature to add to the package.

As you have seen, the coverage data comes from bamsignals so the key here will be to get it return the desired values. If you get it to work, I'll be more than happy to accept a pull request or some code :) otherwise I'll add it to the (not so short!) TODO list!

Thanks for reporting!

Bernat

slowkow commented 4 years ago

This code seems to give a coverage profile that is more similar to IGV.

library(Rsamtools)
summaryFunction <- function(gr, bamFile, ...) {
  param <- ScanBamParam(
    what = c("pos", "qwidth"),
    which = gr,
    flag = scanBamFlag(isUnmappedQuery = FALSE)
  )
  x <- scanBam(bamFile, ..., param = param)[[1]]
  coverage(IRanges(x[["pos"]], width = x[["qwidth"]]))
}
cvg <- summaryFunction(regions[[this_gene]], bam_file)
d_cvg <- as.data.frame(cvg)
d_cvg$x <- rownames(d_cvg)
d_cvg <- d_cvg[d_cvg[,1] > 0,]
colnames(d_cvg) <- c("y", "x")
d_cvg$x <- as.integer(d_cvg$x)
kpArea(
  kp,
  chr = as.character(seqnames(regions[[this_gene]])),
  x = d_cvg$x,
  y = d_cvg$y,
  ymin = 0,
  ymax = max(d_cvg$y),
  r0 = 0.25, r1 = 1, col = "grey40"
)
kpAxis(kp, ymin = 0, ymax = max(d_cvg$y), r0 = 0.25, r1 = 1)

Here is the contents of regions[["HLA-A"]]:

> regions[["HLA-A"]]
GRanges object with 1 range and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]        6 29941160-29945984      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

image

The figure is still not identical to IGV, because IGV reports a range from 0 to 4361 and Rsamtools reports a range from 0 to 6666. My best guess is that IGV implicitly applies some flags to filter reads. It would be nice if we could find where IGV defines those flags.