ShixiangWang / sigminer

🌲 An easy-to-use and scalable toolkit for genomic alteration signature (a.k.a. mutational signature) analysis and visualization in R https://shixiangwang.github.io/sigminer/reference/index.html
https://shixiangwang.github.io/sigminer/
Other
144 stars 18 forks source link

show_cn_distribution with scale_chr = T #420

Closed jrcodina closed 2 years ago

jrcodina commented 2 years ago

In line 40 of the function:

if (scale_chr) {
      chrlen <- get_genome_annotation(data_type = "chr_size", 
        genome_build = genome_build)
      p <- ggplot(data, aes(x = chromosome, fill = location)) + 
        geom_bar() + xlab("Chromosome")
      q <- ggplot_build(p)$data[[1]][, c("x", "count", 
        "fill")]
      q$x <- factor(q$x, levels = c(1:23), labels = c(1:22, 
        "X"))
      q$fill <- factor(q$fill, levels = c("#F8766D", "#00BA38", 
        "#619CFF"))
      chrlen$chrom <- gsub(pattern = "chr", replacement = "", 
        x = chrlen$chrom)
      q <- merge(q, chrlen, by.x = "x", by.y = "chrom")
      q[["count"]] <- 1e+06 * (q[["count"]]/q[["size"]])
      if (!fill) {
        ggplot(q, aes(x, y = count, fill = fill)) + 
          geom_bar(stat = "identity") + scale_fill_discrete(name = "location", 
          labels = c("p", "pq", "q")) + labs(x = "Chromosome", 
          y = "Normalized count (per Mb)")
      }
      else {
        ggplot(q, aes(x, y = count, fill = fill)) + 
          geom_bar(stat = "identity", position = "fill") + 
          scale_fill_discrete(name = "location", labels = c("p", 
            "pq", "q")) + labs(x = "Chromosome", y = "Percentage")
      }
    }

I believe the chromosome number is lost when you create the object "q" :

q <- ggplot_build(p)$data[[1]][, c("x", "count", 
        "fill")]

After this, the "q$x" is treated as the chromosome number but it isn't. To obtain the chromosome number, you can take it from:

(ggplot_build(p)$plot[[1]][,"chromosome"]

I would propose the following correction, but there might be an easier one.

if (scale_chr) {
      chrlen <- get_genome_annotation(data_type = "chr_size", 
                                      genome_build = genome_build)
      p <- ggplot(data, aes(x = chromosome, fill = location)) + 
        geom_bar() + xlab("Chromosome")
      cr <- data.frame(unique(ggplot_build(p)$plot[[1]][,"chromosome"][order(ggplot_build(p)$plot[[1]][,"chromosome"])]))
      cr$x <- 1:nrow(cr) 
      q <- ggplot_build(p)$data[[1]][, c("x", "count", 
                                         "fill")]
      q$x <- as.integer(q$x)
      q <- dplyr::full_join(q, cr)
      q$chromosome <- factor(q$chromosome, levels = c(1:23), labels = c(1:22, 
                                                      "X"))
      q$fill <- factor(q$fill, levels = c("#F8766D", "#00BA38", 
                                          "#619CFF"))
      chrlen$chrom <- gsub(pattern = "chr", replacement = "", 
                           x = chrlen$chrom)
      q <- merge(q, chrlen, by.x = "chromosome", by.y = "chrom")
      q[["count"]] <- 1e+06 * (q[["count"]]/q[["size"]])
      if (!fill) {
        ggplot(q, aes(x, y = count, fill = fill)) + 
          geom_bar(stat = "identity") + scale_fill_discrete(name = "location", 
                                                            labels = c("p", "pq", "q")) + labs(x = "Chromosome", 
                                                                                               y = "Normalized count (per Mb)")
      }
      else {
        ggplot(q, aes(x, y = count, fill = fill)) + 
          geom_bar(stat = "identity", position = "fill") + 
          scale_fill_discrete(name = "location", labels = c("p", 
                                                            "pq", "q")) + labs(x = "Chromosome", y = "Percentage")
      }
    }

In this same function, I also found that when plotting through

ggplot(q, aes(x, y = count, fill = fill)) + 
          geom_bar(stat = "identity") + scale_fill_discrete(name = "location", 
                                                            labels = c("p", "pq", "q")) + labs(x = "Chromosome", 
                                                                                               y = "Normalized count (per Mb)")

The labels are static. In my data set there is no "pq", however, since the labels are always labels = c("p", "pq", "q") it will plot my barplot with labels "p" and "pq" instead of the real value that should be "p" and "q".

I am not sure how could I change that.

These problems disappear when you run the function without scale_chr = T.

ShixiangWang commented 2 years ago

@jrcodina96 Thanks for reporting, I will take a look

ShixiangWang commented 2 years ago

@jrcodina96 Please install the latest version from GitHub and retry.

The example code shows this bug shall be fixed.

library(sigminer)

load(system.file("extdata", "toy_copynumber.RData",
                 package = "sigminer", mustWork = TRUE
))
cn2 = cn
cn2@annotation = cn2@annotation[!endsWith(cn2@annotation$location, "pq")]
# Plot distribution
show_cn_distribution(cn2, mode = "cd", scale_chr = FALSE)

show_cn_distribution(cn2, mode = "cd")
show_cn_distribution(cn2, mode = "cd", fill = TRUE)

# It shall work for three types
show_cn_distribution(cn, mode = "cd")
show_cn_distribution(cn, mode = "cd", fill = TRUE)

# Also should work for just one type
cn3 = cn
cn3@annotation = cn3@annotation[endsWith(cn3@annotation$location, "p")]

show_cn_distribution(cn3, mode = "cd")
show_cn_distribution(cn3, mode = "cd", fill = TRUE)
jrcodina commented 2 years ago

This fixes the bug. I have tried your example and my dataset and works fine with both. Thank you!

ShixiangWang commented 2 years ago

@jrcodina96 Great, thanks to your report and contribution on this issue.