YuLab-SMU / ChIPseeker

:dart: ChIP peak Annotation, Comparison and Visualization
https://onlinelibrary.wiley.com/share/author/GYJGUBYCTRMYJFN2JFZZ?target=10.1002/cpz1.585
220 stars 74 forks source link

Adding function of plotting bioregion by binning #156

Closed MingLi-929 closed 2 years ago

MingLi-929 commented 3 years ago

This pull request is to add the function of plotting bioregion including genebody, TSS, 5UTR_SS. The basic idea is to scale the bioregions having different lengths to the equal length, the idea was derived from the function of deeptools. (https://deeptools.readthedocs.io/en/develop/content/tools/computeMatrix.html)

The way to reach the goal of scaling bioregion is to devide the region into the same amounts of boxes. Because the amount of boxes is equal, the bioregion can be thought of scaling to equal length.So the parameter box is set to reach this goal.

Four functions are created to reach this goal.

  1. getGeneBody() can get genebody region
    # three kinds of regions are supported:  "genes", "exon", "intron"
    getGeneBody <- function(TxDb=NULL,
                        type="genes")

2.getBioRegionMatrix() can calculate the bioregion matrix

# upstream and downstream parameter should be rel object , interger or default(NULL)
# rel object reflects extend the flank of genebody by relative number
# default(NULL) reflects genebody with no flank extension
# interger reflects extend the flank of genebody by absolute number
# TSS region is not supported to be extended by getGenebodyMatrix(), it can extend by getPromoters()

getBioRegionMatrix <- function(peak, 
                               weightCol = NULL, 
                               windows, 
                               box = 800,
                               min_body_length = 1000,
                               upstream = NULL,
                               downstream = NULL,
                               ...){

3.plotBioRegion() can plot with matrix derived by getGenebodyMatrix()

# upstream and downstream should be rel object, NULL or integer
# rel object reflects extend genebody by relative value
# default(NULL) reflects genebody with no flank extension
# integer reflects extend genebody by absolute value
# integer can also be the flank of start_region

plotBioRegion <- function(bioregionmatrix, 
                          xlab = "Scaled Genomic Region (5'->3')",
                          ylab = "Peak Count Frequency",
                          conf,
                          facet ="none", free_y = TRUE,
                          upstream = NULL,
                          downstream = NULL,
                          ...)
  1. plotBioRegion2() supports plotting region automatically
    
    # upstream and downstream should be rel object, NULL or integer
    # rel object reflects extend genebody by relative value
    # default(NULL) reflects genebody with no flank extension
    # integer reflects extend genebody by absolute value or start_region

plotBioRegion2 <- function(peak, weightCol = NULL, TxDb = NULL, body_type = NULL, start_region_by = NULL, xlab = "Scaled Genomic Region (5'->3')", ylab = "Peak Count Frequency", conf, facet = "none", free_y = TRUE, verbose = TRUE, box = 800, min_body_length = 1000, upstream = NULL, downstream = NULL, ...)

The usage and testing of the function lay below.

devtools::install_github("MingLi-929/ChIPseeker")

library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

files <- getSampleFiles() peak <- files[[4]]

get the windows of promoter

promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)

get the windows of genebody

genebody <- getGeneBody(TxDb = txdb, type = "gene")

get the windows of the start site of exon

exon <- getBioRegion(TxDb = txdb, downstream = 3000, upstream = 3000, by = "exon")

Here is the plotting part

use plotAvgProf() to draw the TSS region

tagmatrix <- getTagMatrix(peak, weightCol = "V5", windows = promoter) plotAvgProf(tagmatrix, xlim = c(-3000,3000),conf = 0.95)

![plotavgprof_TSS](https://user-images.githubusercontent.com/78794151/134764286-a9250618-45d9-4356-a475-0accde454ac1.png)

plotBioRegion can also draw the TSS region

the upstream and downstream parameter should set NULL(default)

promoter_matrix <- getBioRegionMatrix(peak, weightCol = "V5", windows = promoter, box=800, min_body_length = 1000)

plotting TSS region should input upstream and downstream parameter

when plotting TSS region, actual number of flank should be input

plotBioRegion(promoter_matrix,conf = 0.95, upstream = 3000, downstream = 3000)

![plotgenebody_TSS](https://user-images.githubusercontent.com/78794151/134764344-ab035de4-138c-49c7-8b1f-e3c4a4cd0b8b.png)

can also plot the exon flank region

exon_matrix <- getBioRegionMatrix(peak, weightCol = "V5", windows = exon, box=800, min_body_length = 1000)

plotBioRegion(exon_matrix,conf = 0.95, upstream = 3000, downstream = 3000)

![exon](https://user-images.githubusercontent.com/78794151/135034265-3ddbd804-46c0-427b-97c0-c9d7898f61db.png)

use plotAvgProf2 to draw the TSS region

plotAvgProf2(peak, TxDb=txdb, upstream=3000, downstream=3000, conf = 0.95)

![plotavgprof2_TSS](https://user-images.githubusercontent.com/78794151/134764489-b7f71ad0-5f1d-461e-b98f-f04b4c69da10.png)

plotBioRegion2 to draw the TSS region

if plotting the TSS region,

upstream and downstream parameter must be set

plotBioRegion2(peak, start_region_by = "gene", conf = 0.95, TxDb = txdb, upstream = 3000, downstream = 3000)

![plotgenebody2](https://user-images.githubusercontent.com/78794151/134764604-9a8c6a8d-b331-473b-9b9b-d9aaf9f01f7b.png)

can also plot start site of 5UTR

plotBioRegion2(peak, start_region_by = "5UTR", conf = 0.95, TxDb = txdb, upstream = 3000, downstream = 3000)

![5UTR](https://user-images.githubusercontent.com/78794151/135034477-ed053c83-0c89-46b9-a98e-0e878da6063b.png)

use plotAvgProf2 to draw a list of the TSS region

plotAvgProf2(files,TxDb = txdb, upstream = 3000, downstream = 3000, conf = 0.95,facet = "row")

![plotavgprof2_list](https://user-images.githubusercontent.com/78794151/134764746-138b5643-04a5-4da1-b5c6-81b02d720f30.png)

use plotBioRegion2 to draw a list of the TSS region

the type should be chosen as "promoters",

and the upstream and downstream parameter must be set

plotBioRegion2(files, start_region_by = "gene", conf = 0.95,facet = "row", TxDb = txdb, upstream = 3000, downstream = 3000)

![plotgenebody2_list](https://user-images.githubusercontent.com/78794151/134764880-9d9f3354-8883-4bc4-a208-2f80d7f39ae7.png)

use plotBioRegion to draw the genebody

genebodymatrix <- getBioRegionMatrix(peak,weightCol = "V5", windows = genebody)

the upstream and downstream parameters are NULL(default)

a plot with no flank extension will be drawn

plotBioRegion(genebodymatrix, conf=0.95)

![noflankextension](https://user-images.githubusercontent.com/78794151/134764900-0bd37099-3488-44e8-af3e-2b981a115e40.png)

genebody matrix can be extended by upstream and downstream

the parameter should be rel object or acutal length

genebodymatrix_extend <- getBioRegionMatrix(peak,weightCol = "V5", windows = genebody, upstream = rel(0.2), downstream = rel(0.2))

plotBioRegion(genebodymatrix_extend, upstream = rel(0.2), downstream = rel(0.2), conf = 0.95)

![plotgenebody_extend](https://user-images.githubusercontent.com/78794151/134764920-a3e93ef9-dcd6-480c-a909-afb71ca1162e.png)

extend by absolute number

genebodymatrix_actual <- getBioRegionMatrix(peak,weightCol = "V5", windows = genebody, upstream = 1000, downstream = 1000)

plotBioRegion(genebodymatrix_actual, upstream = 1000, downstream = 1000, conf = 0.95)

![actual](https://user-images.githubusercontent.com/78794151/134943831-19854622-752c-484c-ba5d-8d2103e2b38e.png)

use plotBioRegion2 to draw genebody automatically

plotBioRegion2(peak, body_type = "genes", conf = 0.95, TxDb = txdb, upstream = rel(0.2), downstream = rel(0.2))

![plotgenebody2_extend](https://user-images.githubusercontent.com/78794151/134764937-fe2b0cd2-51f0-4b22-8c42-a53da83c4c7b.png)

draw a list of genebody with flank information

plotBioRegion2(files, body_type = "genes", conf = 0.95,facet = "row", TxDb = txdb, upstream = rel(0.2), downstream = rel(0.2))


![extend_list](https://user-images.githubusercontent.com/78794151/134764941-17e48210-2d2e-40e6-9798-2d30555b5b92.png)
GuangchuangYu commented 3 years ago

如果TSS上面还有序列,TES后面还有序列呢?有没有example?你不需要反复关闭PR,又开PR,可以不断commit新的到同一个PR中。

GuangchuangYu commented 3 years ago

像你最后一个图,看着感觉在TSS和TES的附近就是会有一个峰,那么延伸一下,再画个图,是肯定的。

GuangchuangYu commented 2 years ago

为什么我后面没有来看你这个PR,应该你后面commit了之后,没有回复,没有给我反馈。

  1. 等长写实际数字

image

像这个promoter的,是等长的,我希望等长的情况下,那些4分位置,写的是实际的距离,多少bp。

  1. 延长这个参数不需要,由upstream, downstream参数确定是否延长

image

如果upstream和downstream是数字,延长指定的长度,如果是画基因体的话,延长的部分又是等长的,参照第一点,延长部分的label,用实际数字。

upstream和downstream参数可以接收相对值,用ggplot2::rel()参入,你测试一个是rel对象就是相对值了,那么每一个基因,延长基因体相对应比值的长度。

GuangchuangYu commented 2 years ago

如果upstream和downstream是数字,延长指定的长度,如果是画基因体的话,延长的部分又是等长的,参照第一点,延长部分的label,用实际数字。

upstream和downstream参数可以接收相对值,用ggplot2::rel()参入,你测试一个是rel对象就是相对值了,那么每一个基因,延长基因体相对应比值的长度。

我最后的这两句话,你就没看。你的示例中只有upstream=0.2,我说的是upstream = rel(0.2),因为这样可以区分相对值和绝对值。

然后你的示例里就没有绝对值的,比如upstream=1000,那么上游就延伸1000bp,然后上游的x axis label,就写1000,按实际数字写。

GuangchuangYu commented 2 years ago

以及不要每次的commit message都是个update,要写有意义的commit msg.

GuangchuangYu commented 2 years ago
f <- function(n) {
    if(!is.numeric(n)) stop('n should be numeric value')
    if (inherits(n, 'rel')) {
        base <- 100 # ?
        return(as.numeric(n * base))
    }
    return(n)
}

ggplot2::rel() 就是一个数字,加了一个rel的类属性,用于区分是绝对值还是相对值。在ggplot2中各种size的参数,你可以用实际值指定大小,你也可以用rel指定相对大小(缩放倍数,由于默认值不是rel(1),所以你rel(2)不是当前看到的放大两倍)。

> f(rel(3))
[1] 300
> f(3 )
[1] 3
> is.numeric(rel(3))
[1] TRUE
> class(rel(3))
[1] "rel"
GuangchuangYu commented 2 years ago

-1000, 1000这些label,写-1000bp, 1000bp。

我觉得is_TSSregion这个参数是可以去掉的。

你在生成的tag matrix里,给它加个属性,这个tag matrix是什么,这个由windows参数传入的东西确定,所以是getBioRegion和getPromoters的输出加个属性,然后传入给getTagMatrix的时候,把这个属性转加到getTagMatrix的输出中,然后在画图的时候,就可以区分是什么区域画图,不仅仅像你这个is_TSSregion只区分是否TSS了,可以有不同的,比如是否是基因体,是否是可变剪切位点等等。

> x = matrix(rnorm(100), ncol=10)
> attr(x, 'type') = 'TSS'
> attr(x, 'type')
[1] "TSS"
> rowMeans(x)
 [1] -0.35381022 -0.40562894 -0.06152096  0.20289575  0.21946214  1.13116883
 [7] -0.54431412  0.20081212  0.28930633  0.13965037

你尝试一下用attr去加属性。

GuangchuangYu commented 2 years ago

更新一下vignette,把你之前做的功能都写进去。