YuLab-SMU / ChIPseeker

:dart: ChIP peak Annotation, Comparison and Visualization
https://onlinelibrary.wiley.com/share/author/GYJGUBYCTRMYJFN2JFZZ?target=10.1002/cpz1.585
219 stars 74 forks source link

change the method of plotting heatmap #199

Closed MingLi-929 closed 1 year ago

MingLi-929 commented 1 year ago

This pr change the way of plotting heatmap from base system to ggplot system. It adds the function of gradient color by giving palette parameter, which is the palette of gradient color.

There are two ways to plot the heatmap. The first is giving a created matrix to tagHeatmap() fucntion.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
files <- getSampleFiles()
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
mt <- getTagMatrix(peak = files[[5]],TxDb = txdb,
                   upstream = 1000,downstream = 1000,
                   type = "start_site",by = "gene",
                   weightCol = "V5")

tagHeatmap(mt)

test1

The second is to use peakHeatmap() to do all things in one steps.

peakHeatmap(files[[4]],TxDb = txdb)

test2

If users wanna to speed up the process, a binning method can be used by specifing the nbin.

peakHeatmap(files[[4]],TxDb = txdb, nbin = 800)

test3

Since this method use ggplot system, users can use ggplot method to change

peakHeatmap(files[[4]],TxDb = txdb, nbin = 800) +
    scale_fill_distiller(palette = "RdYlGn")

test4 a list of peaks can also be plotted.

peakHeatmap(files,TxDb = txdb)

test5

MingLi-929 commented 1 year ago

after updating, the picture looks like test1

MingLi-929 commented 1 year ago

after the update, the picture for binning method will be like

peakHeatmap(files[[4]],TxDb = txdb,nbin = 700)

test

Treywea commented 1 year ago

Liming,

  1. Update the function to plot the heatmap around the gene body regions (like the scale-regions mode in deeptools).
  2. In line plots, it is easy to visualize the profiles of different gene sets.

image

Can we plot heatmaps with different gene sets? like the plots with genesX and gene19 in deeptools).

Treywea commented 1 year ago

In ArchR, the co-accessibility function based on Cicero is deployed. What advanced function you can provide if you want to add this function in ChIPseeker?

Also, they can do trajectory analysis and provide pseudo-time heatmaps. I think our heatmap functions could also support the plots with their results while they have provided the labels for selected regions/genes. Consider adding the labels in our heatmap functions.

MingLi-929 commented 1 year ago

Update the function to plot the heatmap around the gene body regions


library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

plot body region

peakHeatmap(peak = files[[4]], TxDb = txdb, upstream = NULL, downstream = NULL, by = "gene", type = "body", nbin = 800)

![test2](https://user-images.githubusercontent.com/78794151/194057060-e8342095-aa14-49cf-8975-eb0e0b6c721b.png)

extend body region with rel object

peakHeatmap(peak = files[[4]], TxDb = txdb, upstream = rel(0.2), downstream = rel(0.2), by = "gene", type = "body", nbin = 800)

![test](https://user-images.githubusercontent.com/78794151/194057188-980128c9-b5a8-447c-a55d-0f4eb27e8dde.png)

extend body region with actual number

peakHeatmap(peak = files[[4]], TxDb = txdb, upstream = 1000, downstream = 1000, by = "gene", type = "body", nbin = 800)


![test1](https://user-images.githubusercontent.com/78794151/194057471-9ebd6f90-6c59-4c4e-8714-bfe82bcb057a.png)
MingLi-929 commented 1 year ago

Can we plot heatmaps with different gene sets? like the plots with genesX and gene19 in deeptools).

txdb1 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb2 <- unlist(fiveUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))

txdb <- list(gene19 = txdb1, geneX = txdb2)
peakHeatmap_multiple_Sets(peak = files[[4]],
                          upstream = 1000,downstream = 1000,
                          by = c("gene19","geneX"),
                          type = "start_site",
                          TxDb = txdb,nbin = 800)

t2

sample <- list(files[[4]],files[[5]])
peakHeatmap_multiple_Sets(peak = files[4:5],
                          upstream = 1000,
                          downstream = 1000,
                          by = c("gene19","geneX"),type = "start_site",
                          TxDb = txdb,nbin = 800)

t3

the proportion of picture can be adjusted automactically

txdb1 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb2 <- unlist(fiveUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))[1:10000,]

txdb <- list(gene19 = txdb1, geneX = txdb2)
peakHeatmap_multiple_Sets(peak = files[[4]],
                          upstream = 1000,downstream = 1000,
                          by = c("gene19","geneX"),
                          type = "start_site",
                          TxDb = txdb,nbin = 800)

t4

MingLi-929 commented 1 year ago

Can we plot heatmaps with different gene sets? like the plots with genesX and gene19 in deeptools).

peak_Profile_Heatmap function is also created to meet the need of ploting profiling plot and heatmap together, just like deeptools output.

peak_Profile_Heatmap(peak = files[[4]],
                     upstream = 1000,
                     downstream = 1000,
                     by = "gene",
                     type = "start_site",
                     TxDb = txdb3,
                     nbin = 800)

t7

MingLi-929 commented 1 year ago

caplot class is created to debug output of aplot. The only thing it do is to change the align method.

Before changing, t5(1) the coordinate of heatmap does not match the profiling picture,and the coordinate of heatmap have been changed

after changing the align method t7

MingLi-929 commented 1 year ago

for the need of patch picture i change from creating a new class to overwriting the aplot method

MingLi-929 commented 1 year ago

after overwriting the aplot method,

peak_Profile_Heatmap(peak = files[[4]],
                     upstream = 1000,
                     downstream = 1000,
                     by = c("gene19","geneX"),
                     type = "start_site",
                     TxDb = txdb,nbin = 800)

t5

peak_Profile_Heatmap(peak = files[4:5],
                     upstream = 1000,
                     downstream = 1000,
                     by = c("gene19","geneX"),
                     type = "start_site",
                     TxDb = txdb,nbin = 800)

t6

MingLi-929 commented 1 year ago

Also fix issue #200 ,export getAnnoStat

MingLi-929 commented 1 year ago

deal with problem in check() 2

MingLi-929 commented 1 year ago

add columns to annotatePeak() to fix #193 The columns can help to get the specific columns from database. From the bug reported from #193, users can not get ENSEMBL even if pass in database

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(EnsDb.Hsapiens.v86)

files <- getSampleFiles()
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

Anno <-  (files[[4]], tssRegion=c(-3000, 3000), 
                     TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene, 
                     annoDb="EnsDb.Hsapiens.v86")

> Anno@anno
GRanges object with 1331 ranges and 13 metadata columns:
         seqnames              ranges strand |             V4        V5
            <Rle>           <IRanges>  <Rle> |    <character> <numeric>
     [1]     chr1       815093-817883      * |    MACS_peak_1    295.76
     [2]     chr1     1243288-1244338      * |    MACS_peak_2     63.19
     [3]     chr1     2979977-2981228      * |    MACS_peak_3    100.16
     [4]     chr1     3566182-3567876      * |    MACS_peak_4    558.89
     [5]     chr1     3816546-3818111      * |    MACS_peak_5     57.57
     ...      ...                 ...    ... .            ...       ...
  [1327]     chrX 135244783-135245821      * | MACS_peak_1327     55.54
  [1328]     chrX 139171964-139173506      * | MACS_peak_1328    270.19
  [1329]     chrX 139583954-139586126      * | MACS_peak_1329    918.73
  [1330]     chrX 139592002-139593238      * | MACS_peak_1330    210.88
  [1331]     chrY   13845134-13845777      * | MACS_peak_1331     58.39
                     annotation   geneChr geneStart   geneEnd geneLength
                    <character> <integer> <integer> <integer>  <integer>
     [1]       Promoter (2-3kb)         1    803451    812182       8732
     [2]       Promoter (<=1kb)         1   1243994   1247057       3064
     [3]       Promoter (<=1kb)         1   2976181   2980350       4170
     [4]       Promoter (<=1kb)         1   3547331   3566671      19341
     [5]       Promoter (<=1kb)         1   3816968   3832011      15044
     ...                    ...       ...       ...       ...        ...
  [1327] Intron (uc010nrz.2/2..        23 135251455 135293518      42064
  [1328]       Promoter (<=1kb)        23 139173826 139175070       1245
  [1329]       Promoter (1-2kb)        23 139585152 139587225       2074
  [1330]      Distal Intergenic        23 139585152 139587225       2074
  [1331]      Distal Intergenic        24  14517915  14533389      15475
         geneStrand      geneId transcriptId distanceToTSS      SYMBOL    GENENAME
          <integer> <character>  <character>     <numeric> <character> <character>
     [1]          2      284593   uc001abt.4         -2911      FAM41C      FAM41C
     [2]          1      126789   uc001aed.3             0       PUSL1       PUSL1
     [3]          2      440556   uc001aka.3             0   LINC00982   LINC00982
     [4]          2       49856   uc001ako.3             0      WRAP73      WRAP73
     [5]          1   100133612   uc001alg.3             0   LINC01134   LINC01134
     ...        ...         ...          ...           ...         ...         ...
  [1327]          1        2273   uc004ezn.2         -5634        FHL1        FHL1
  [1328]          1      389895   uc031tkm.1          -320        <NA>        <NA>
  [1329]          2        6658   uc004fbd.1          1099        SOX3        SOX3
  [1330]          2        6658   uc004fbd.1         -4777        SOX3        SOX3
  [1331]          2      352887   uc022cji.1        687612        <NA>        <NA>
  -------
  seqinfo: 24 sequences from hg19 genome

after adding columns, users can get the specific columns from database.

Anno1 <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000), 
                      TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene, 
                      annoDb="EnsDb.Hsapiens.v86",
                      columns=c("ENTREZID", "GENEID", "SYMBOL", "GENENAME"))

> Anno1@anno
GRanges object with 1331 ranges and 14 metadata columns:
         seqnames              ranges strand |             V4        V5
            <Rle>           <IRanges>  <Rle> |    <character> <numeric>
     [1]     chr1       815093-817883      * |    MACS_peak_1    295.76
     [2]     chr1     1243288-1244338      * |    MACS_peak_2     63.19
     [3]     chr1     2979977-2981228      * |    MACS_peak_3    100.16
     [4]     chr1     3566182-3567876      * |    MACS_peak_4    558.89
     [5]     chr1     3816546-3818111      * |    MACS_peak_5     57.57
     ...      ...                 ...    ... .            ...       ...
  [1327]     chrX 135244783-135245821      * | MACS_peak_1327     55.54
  [1328]     chrX 139171964-139173506      * | MACS_peak_1328    270.19
  [1329]     chrX 139583954-139586126      * | MACS_peak_1329    918.73
  [1330]     chrX 139592002-139593238      * | MACS_peak_1330    210.88
  [1331]     chrY   13845134-13845777      * | MACS_peak_1331     58.39
                     annotation   geneChr geneStart   geneEnd geneLength
                    <character> <integer> <integer> <integer>  <integer>
     [1]       Promoter (2-3kb)         1    803451    812182       8732
     [2]       Promoter (<=1kb)         1   1243994   1247057       3064
     [3]       Promoter (<=1kb)         1   2976181   2980350       4170
     [4]       Promoter (<=1kb)         1   3547331   3566671      19341
     [5]       Promoter (<=1kb)         1   3816968   3832011      15044
     ...                    ...       ...       ...       ...        ...
  [1327] Intron (uc010nrz.2/2..        23 135251455 135293518      42064
  [1328]       Promoter (<=1kb)        23 139173826 139175070       1245
  [1329]       Promoter (1-2kb)        23 139585152 139587225       2074
  [1330]      Distal Intergenic        23 139585152 139587225       2074
  [1331]      Distal Intergenic        24  14517915  14533389      15475
         geneStrand      geneId transcriptId distanceToTSS          GENEID
          <integer> <character>  <character>     <numeric>     <character>
     [1]          2      284593   uc001abt.4         -2911 ENSG00000230368
     [2]          1      126789   uc001aed.3             0 ENSG00000169972
     [3]          2      440556   uc001aka.3             0 ENSG00000177133
     [4]          2       49856   uc001ako.3             0 ENSG00000116213
     [5]          1   100133612   uc001alg.3             0 ENSG00000236423
     ...        ...         ...          ...           ...             ...
  [1327]          1        2273   uc004ezn.2         -5634 ENSG00000022267
  [1328]          1      389895   uc031tkm.1          -320            <NA>
  [1329]          2        6658   uc004fbd.1          1099 ENSG00000134595
  [1330]          2        6658   uc004fbd.1         -4777 ENSG00000134595
  [1331]          2      352887   uc022cji.1        687612            <NA>
              SYMBOL    GENENAME
         <character> <character>
     [1]      FAM41C      FAM41C
     [2]       PUSL1       PUSL1
     [3]   LINC00982   LINC00982
     [4]      WRAP73      WRAP73
     [5]   LINC01134   LINC01134
     ...         ...         ...
  [1327]        FHL1        FHL1
  [1328]        <NA>        <NA>
  [1329]        SOX3        SOX3
  [1330]        SOX3        SOX3
  [1331]        <NA>        <NA>
  -------
  seqinfo: 24 sequences from hg19 genome

here we can see the ENSEMBL id

GuangchuangYu commented 1 year ago

The figures are pretty weird at first glance. Can you provide a before-and-after comparison? We must confirm that the new implementation produces an identical image as the old one did.

MingLi-929 commented 1 year ago

this is the comparison of the heatmap of old version and new version 捕获

GuangchuangYu commented 1 year ago

https://github.com/YuLab-SMU/ChIPseeker/pull/199#issuecomment-1272295000, quite close, but not identical as expected.

MingLi-929 commented 1 year ago

the aplot.R is removed, through aplot v=0.1.8

and disorder of the peak in heatmap is corrected 1

MingLi-929 commented 1 year ago

this commit fix bug #201, after updating,

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

files <- getSampleFiles()
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

peakAnno <- annotatePeak(files[[4]], 
                         tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno)

1

nrow <- nrow(peakAnno@annoStat)

col <- c("#8dd3c7", "#ffffb3", "#bebada",
         "#fb8072", "#80b1d3", "#fdb462",
         "#b3de69", "#fccde5", "#d9d9d9",
         "#bc80bd", "#ccebc5", "#ffed6f")

cols <- col[1:nrow]

plotAnnoPie(peakAnno,col = col[1:nrow])

2