YuLab-SMU / ChIPseeker

:dart: ChIP peak Annotation, Comparison and Visualization
https://onlinelibrary.wiley.com/share/author/GYJGUBYCTRMYJFN2JFZZ?target=10.1002/cpz1.585
223 stars 74 forks source link

plotAvgProf weightCol not plotted #15

Closed francycasa closed 8 years ago

francycasa commented 9 years ago

I am trying to plot the values of V5 in my data:

plotAvgProf2(peak, weightCol="V5", TxDb=txdb, upstream=3000, downstream=3000)

where "peak" is the GRange object that looks like this:

GRanges object with 100001 ranges and 4 metadata columns:

       seqnames               ranges strand   |          V4         V5

          <Rle>            <IRanges>  <Rle>   |    <factor>  <numeric>

   [1]     chr1     [ 10010,  10664]      *   |      Peak_1  4.4973750

   [2]     chr1     [713785, 714569]      *   |      Peak_2  5.4287500

   [3]     chr1     [762394, 763144]      *   |      Peak_3  3.3098750

   [4]     chr1     [815103, 817360]      *   |      Peak_4  7.0947500

But it defaults to always plotting the read frequencies count instead of the values on my V5 column. It seems to work with one peak only provided by the vignette but I can't see what the difference is between my peak and the peak in the Vignette... Is there maybe an additional specification I need to provide, such as ylim values? I think the problem is in the coordinates:

Error in plotAvgProf.internal(tagMatrix, xlim = xlim, xlab = xlab, ylab = ylab, :

please specify appropreate xcoordinations...

Calls: plotAvgProf -> plotAvgProf.internal

Also when I try the bootstrap I get this error:

"Error in boot(data = tagMatrix, statistic = getSgn, R = RESAMPLE_TIME, :

number of items to replace is not a multiple of replacement length"

Thank you for your help and great work!

Puriney commented 9 years ago

Thank you for using ChIPseeker.

Suggestions for Q1

Source

The function plotAvgProf2 is provide a one step from bed file to average profile plot.

As in our manual, it means that there is no need to feed plotAvgProf2 with GRange Object, but the BED file alone will be enough to generate profile plot. Pretty convenient :beers:

Suggestions for Q2

I assumed the failure of running bootstrap should be due to Q1.

francycasa commented 9 years ago

Hi and thank you for getting back to me. Being able to use bed files is great but it actually is worse in my particular case because I am adding the metacolumns V5 into the GRange object after importing into R... Anyway I have tried re-saving the bed files after with those additional columns and running plotAvgProf2 directly on the bed files but I still have the same problem that I get a plot of the read frequencies instead of the V5 values... I think there is a limit on the y values maybe in the function?

I am trying to understand whether it is the tagMatrix that does not recognise the weightCol "V5" in my data, but it looks like it has the correct form, so maybe it could be a limit that is set in the plotAvgProf.internal function, which gives the default of plotting the read frequency if some limits are not met? The "dd" data frame in the getTagCount function basically gives me the values of the read frequencies instead of the values of the weightCol, but I am not sure where this goes wrong... the tagMatrix from the getTagMatrix looks fine as well as the peak.cov, but then when it is passed to the plotAvgProf.internal it doesn't recognise the "value"...

Thank you very much for advice on how to solve this.

Puriney commented 9 years ago

The "dd" data frame in the getTagCount function basically gives me the values of the read frequencies instead of the values of the weightCol, but I am not sure where this goes wrong.

Thank you digging into our codes, and you are correct about dd variable. Attached is is the general pipeline of creating figures.

img_0009

tagMatrix that does not recognise the weightCol "V5" in my data

I don't think so, as @GuangchuangYu took it into account when using coverage.

Anyway I have tried re-saving the bed files.

Well. Wait a sec.


In sum, It seems to me that your V5 column had already represented the the tagMatrix of ChIPseeker.

francycasa commented 9 years ago

Thank you again for going through this with me. The V5 in my data is just a value that ranges from -20 to +20, I am not sure what you mean that it is already represented by a tagMatrix, can you please clarify? When I go through the functions it looks like the step of creating the TagCount outputs the frequency count instead of the value in my V5...

francycasa commented 9 years ago

Just to give a small example of the data and what I am doing. This is only 30 rows out of my data so the picture is expected to look odd, but I don't understand why the function wants to plot the Read Count Frequency and not the values on the V5 column, and same thing happens if I try with the V6 column...

df = structure(list(seqnames = structure(c(3L, 14L, 2L, 10L, 12L, 10L, 11L, 3L, 8L, 7L, 13L, 4L, 10L, 10L, 12L, 14L, 9L, 10L, 14L, 1L, 5L, 11L, 10L, 4L, 12L, 8L, 11L, 3L, 6L, 13L), .Label = c("chr1", "chr11", "chr12", "chr13", "chr14", "chr16", "chr18", "chr2", "chr21", "chr3", "chr4", "chr5", "chr6", "chr9"), class = "factor"), start = c(70455506L, 102221582L, 106400085L, 101414147L, 167474340L, 107595424L, 161734677L, 44432773L, 8782698L, 46041046L, 21168658L, 57022842L, 155593697L, 34921651L, 57794114L, 27246038L, 23021006L, 167966332L, 87983299L, 200978983L, 64319370L, 111431760L, 158430559L, 103686672L, 118240678L, 57391070L, 5859706L, 72454618L, 81623520L, 88661076L), end = c(70457187L, 102222835L, 106401476L, 101415742L, 167475181L, 107596088L, 161735528L, 44434836L, 8783184L, 46041591L, 21169245L, 57024326L, 155594174L, 34922565L, 57794591L, 27246685L, 23022067L, 167968318L, 87984464L, 200979989L, 64320622L, 111432684L, 158431595L, 103687429L, 118241512L, 57391747L, 5860370L, 72455468L, 81624282L, 88662230L), width = c(1682L, 1254L, 1392L, 1596L, 842L, 665L, 852L, 2064L, 487L, 546L, 588L, 1485L, 478L, 915L, 478L, 648L, 1062L, 1987L, 1166L, 1007L, 1253L, 925L, 1037L, 758L, 835L, 678L, 665L, 851L, 763L, 1155L), strand = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("+", "-", "*"), class = "factor"), V4 = structure(c(6L, 3L, 4L, 17L, 28L, 18L, 25L, 5L, 13L, 12L, 29L, 8L, 19L, 16L, 26L, 1L, 15L, 21L, 2L, 23L, 10L, 24L, 20L, 9L, 27L, 14L, 22L, 7L, 11L, 30L), .Label = c("Peak_107511", "Peak_109152", "Peak_109573", "Peak_18281", "Peak_21155", "Peak_22306", "Peak_22404", "Peak_26475", "Peak_28815", "Peak_31979", "Peak_37945", "Peak_41479", "Peak_44216", "Peak_45858", "Peak_55810", "Peak_58971", "Peak_61362", "Peak_61714", "Peak_63686", "Peak_63857", "Peak_64426", "Peak_65741", "Peak_6791", "Peak_70880", "Peak_73699", "Peak_78677", "Peak_81700", "Peak_83750", "Peak_84787", "Peak_87836"), class = "factor"), V5 = c(4.263625, 1.022538125, 1.66370625, 2.4605, -0.054248125, 2.3216125, 4.5945, 4.185125, -2.9880625, -0.45283, -0.6316325, 5.7576875, 3.490875, 3.8881875, 1.44175625, 0.1322116875, 5.416875, -2.0110875, 2.0646875, 1.250445, -0.393173125, 4.3170625, 0.3485, 4.19975, 2.498125, 1.85575, 2.8766875, 3.54625, 1.63593125, 0.59190625), V6 = c(0.295133162035941, 0.657132424444457, 0.469487004390608, 0.25406324671887, 0.512338967601447, 0.574789772438585, 0.510191597996935, 0.556542046929071, 0.728823295342111, 1.32006373482495, 0.627160264169641, 0.496977091188987, 0.717202609681067, 0.438736856403319, 0.426296249054966, 0.332421581784681, 0.595902662633197, 0.539094687879597, 0.604903046639156, 0.706977938458243, 1.00395145416976, 0.280011301855241, 0.773947784629773, 0.379638687877478, 0.404439674117167, 0.604074553897668, 0.850288791627096, 0.386061394081304, 0.851925566168587, 0.711088568997093)), .Names = c("seqnames", "start", "end", "width", "strand", "V4", "V5", "V6"), row.names = c(22306L, 109573L, 18281L, 61362L, 83750L, 61714L, 73699L, 21155L, 44216L, 41479L, 84787L, 26475L, 63686L, 58971L, 78677L, 107511L, 55810L, 64426L, 109152L, 6791L, 31979L, 70880L, 63857L, 28815L, 81700L, 45858L, 65741L, 22404L, 37945L, 87836L), class = "data.frame")

library(GenomicRanges) peak=makeGRangesFromDataFrame(df, keep.extra.columns=T) plotAvgProf2(peak, weightCol="V5", upstream=3000, downstream=3000) plotAvgProf2(peak, weightCol="V6", upstream=3000, downstream=3000)

Error in if ((xlim[2] - xlim[1] + 1) != ncol(tagMatrix)) { : argument is of length zero

Is it possible to adjust the argument to this function to be able to specify these values?

Even if I save this peak as a bed file I still get the same error... df <- data.frame(seqnames=seqnames(peak), starts=start(peak)-1, ends=end(peak), V4=peak$V4, V5=peak$V5, V6=peak$V6)

write.table(df, file="test.bed", row.names = FALSE, quote = FALSE, col.names = FALSE, sep="\t") plotAvgProf2("test.bed", weightCol="V5", upstream=3000, downstream=3000)

Thank you for the help!

Puriney commented 9 years ago

Q: plotAvgProf2 fails

Even if I save this peak as a bed file I still get the same error... df <- data.frame(seqnames=seqnames(peak), starts=start(peak)-1, ends=end(peak), V4=peak$V4, V5=peak$V5, V6=peak$V6) plotAvgProf2("test.bed", weightCol="V5", upstream=3000, downstream=3000)

Notice the format of BED file, see here. So the correct way (let's take your Column-5 as example) is:

df <- data.frame(seqnames=seqnames(peak), starts=start(peak)-1, ends=end(peak), names=peak$V4, 
score=peak$V5, strand=strand(peak))

Feed our plotAvgProf2 with the correct BED file path (here let's say upstream and downstream as 100 for example).

write.table(df, file="./test.bed", row.names = FALSE, quote = FALSE, col.names = FALSE, sep="\t")
plotAvgProf2(peak = "./test.bed", TxDb = txdb, upstream = 100, downstream = 100, xlab = "position", 
ylab = "read count freq", weightCol = "V5"))

test1

The reason why you did not run it through is that txdb is not provided to our function.


Q: CI estimation did not work

Figure looks nothing on it. This is because your test BED files barely overlap with promoter region of interest.

require(TxDb.Hsapiens.UCSC.hg19.knownGene)
testpromoter <- getPromoters(txdb, upstream = 100, downstream = 100)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
files <- getSampleFiles()    # demo files provided in the package

peak <- readPeakFile('./test.bed')    # your file
# Notice `weightCol` is considered
testTagMatrix <- getTagMatrix(peak = peak, weightCol = "V5", windows = testpromoter)
dim(testTagMatrix) # outputs [2, 21] which means there are only 2 promoters overlap with your BED input

Therefore if you run plotAvgProf2 or (plotAvgProf) function with conf = 0.95 for example, you will see errors:

[1] "All values of t are equal to  0.00497512437810945 \n Cannot calculate confidence intervals"

Our codes technically works. But in some extreme case for example here, CI is supposed to fail as CI estimation is based on data itself. So I might add feature to pop up reminders to users :beers: .


To get familiar with how functions work, I suggest you run the following codes. It would be beyond ChIPseeker if math behind CI and biological purpose are discussed here.

# Number behind plots
defaultTagMatrix <- getTagMatrix(peak = files[[4]], weightCol = "V5", windows = testpromoter)
dim(defaultTagMatrix) # outputs [392, 21] which means 392 promoters
# Plots. without weightCol. 
plotAvgProf2(peak = files[[4]], TxDb = txdb, upstream = 100, downstream = 100, xlab = "position", ylab = 
"read count freq", conf = 0.9)

rplot

# Plots. with weightCol
plotAvgProf2(peak = files[[4]], TxDb = txdb, upstream = 100, downstream = 100, xlab = "position", 
ylab = "read count freq", conf = 0.9, weightCol = "V5")

rplot02


francycasa commented 9 years ago

Thank you Puriney for going through this with me!

Re. Q1: I think I am confused about what the function does then... It seems to plot the frequencies of the peaks overlapping the promoters, not the V5 column values... Is there any way to simply plot the actual levels in the "score" column in relation to TSS? I will show what I mean using a bigger sample size and the bed format you have kindly gone through with me:

df=structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "chr1", class = "factor"), starts = c(4741055, 8949981, 1092831, 11040681, 2457306, 9648444, 7728467, 11538619, 825707, 8877303, 6684946, 6052024, 5716392, 3540719, 817676, 1837696, 9970130, 6453251, 1209080, 7843675, 11072211, 6672848, 6662675, 2136093, 1655652, 1623792, 3630819, 1475851, 1310476, 2343778, 822487, 2391146, 11518711, 2517668, 2383961, 10458295, 8086041, 10003272, 954568, 8908792, 713784, 3021960, 4716226, 7830900, 3773648, 1234129, 11322351, 5714966, 2064201, 1406897, 8585642, 3418443, 9599125, 894423, 9488752, 8013975, 10092549, 3135686, 6647274, 8938016, 2019250, 11332787, 6639693, 10848541, 10009, 3816711, 4760920, 2120810, 1840235, 2480031, 11040115, 1369705, 7764457, 11722550, 1167078, 976022, 1550495, 11582465, 762393, 8021255, 2885123, 1051260, 1362592, 7727590, 815102, 6844421, 2322765, 1370745, 9448239, 1447101, 10269901, 11039458, 1243207, 11707727, 1850548, 6761419, 1334563, 8763040, 6259186, 2158938), ends = c(4741865L, 8950435L, 1094326L, 11041264L, 2458333L, 9649189L, 7729045L, 11539333L, 826313L, 8878358L, 6685481L, 6053416L, 5717413L, 3541797L, 818112L, 1838033L, 9971231L, 6454274L, 1209848L, 7844787L, 11072962L, 6674064L, 6663792L, 2136987L, 1656073L, 1624840L, 3631372L, 1476173L, 1311158L, 2344343L, 823789L, 2391606L, 11519522L, 2518430L, 2384403L, 10459330L, 8086894L, 10003716L, 955721L, 8909325L, 714569L, 3022405L, 4716718L, 7831480L, 3774268L, 1234717L, 11323190L, 5715905L, 2065314L, 1407261L, 8586375L, 3418956L, 9599717L, 894938L, 9489495L, 8014591L, 10093416L, 3136262L, 6647681L, 8939484L, 2019678L, 11333665L, 6640260L, 10849062L, 10664L, 3818189L, 4761354L, 2121777L, 1840976L, 2480555L, 11040661L, 1370153L, 7765022L, 11722958L, 1167856L, 976547L, 1551555L, 11582995L, 763144L, 8022085L, 2885754L, 1052036L, 1362889L, 7728062L, 817360L, 6845642L, 2323414L, 1371235L, 9449217L, 1447669L, 10270868L, 11040044L, 1244083L, 11708769L, 1851272L, 6762290L, 1335162L, 8763851L, 6260108L, 2159562L), names = structure(c(49L, 78L, 5L, 93L, 36L, 83L, 65L, 98L, 68L, 75L, 61L, 53L, 52L, 43L, 46L, 22L, 84L, 55L, 7L, 69L, 94L, 60L, 59L, 29L, 21L, 20L, 44L, 18L, 10L, 32L, 57L, 34L, 97L, 38L, 33L, 88L, 72L, 85L, 90L, 76L, 13L, 40L, 48L, 67L, 45L, 8L, 95L, 51L, 27L, 16L, 73L, 42L, 82L, 79L, 81L, 70L, 86L, 41L, 58L, 77L, 26L, 96L, 56L, 89L, 1L, 47L, 50L, 28L, 23L, 37L, 92L, 14L, 66L, 3L, 6L, 2L, 19L, 99L, 24L, 71L, 39L, 4L, 12L, 64L, 35L, 63L, 31L, 15L, 80L, 17L, 87L, 91L, 9L, 100L, 25L, 62L, 11L, 74L, 54L, 30L), .Label = c("Peak_1", "Peak_10", "Peak_100", "Peak_11", "Peak_12", "Peak_13", "Peak_14", "Peak_15", "Peak_16", "Peak_17", "Peak_18", "Peak_19", "Peak_2", "Peak_20", "Peak_21", "Peak_22", "Peak_23", "Peak_24", "Peak_25", "Peak_26", "Peak_27", "Peak_28", "Peak_29", "Peak_3", "Peak_30", "Peak_31", "Peak_32", "Peak_33", "Peak_34", "Peak_35", "Peak_36", "Peak_37", "Peak_38", "Peak_39", "Peak_4", "Peak_40", "Peak_41", "Peak_42", "Peak_43", "Peak_44", "Peak_45", "Peak_46", "Peak_47", "Peak_48", "Peak_49", "Peak_5", "Peak_50", "Peak_51", "Peak_52", "Peak_53", "Peak_54", "Peak_55", "Peak_56", "Peak_57", "Peak_58", "Peak_59", "Peak_6", "Peak_60", "Peak_61", "Peak_62", "Peak_63", "Peak_64", "Peak_65", "Peak_66", "Peak_67", "Peak_68", "Peak_69", "Peak_7", "Peak_70", "Peak_71", "Peak_72", "Peak_73", "Peak_74", "Peak_75", "Peak_76", "Peak_77", "Peak_78", "Peak_79", "Peak_8", "Peak_80", "Peak_81", "Peak_82", "Peak_83", "Peak_84", "Peak_85", "Peak_86", "Peak_87", "Peak_88", "Peak_89", "Peak_9", "Peak_90", "Peak_91", "Peak_92", "Peak_93", "Peak_94", "Peak_95", "Peak_96", "Peak_97", "Peak_98", "Peak_99" ), class = "factor"), score = c(3.6533125, -3.45825, 2.16212625, 4.2144375, 5.0596875, 0.479530625, 2.00475, -4.8058125, -2.051, 5.2161875, 1.95978125, 5.3056875, 0.8137, 5.2, -0.1116425, 0.0250875, 4.7108125, 3.017125, 5.2853125, 2.78585, 3.5574375, 5.4186875, 3.4026875, 4.87, 2.83375, 4.631125, 2.3715375, -0.747491875, 7.2113125, 2.1438875, 3.361375, 6.47175, 3.415, 3.0204375, 2.4506, 3.2089375, 3.985, 5.2888125, 3.8953125, 1.28440625, 5.42875, -5.3949375, 3.835875, 3.1268125, 2.30361875, 5.2608125, 9.0909375, 2.9104375, 6.49075, 9.922375, 7.4885625, -1.006183125, 4.2020625, 2.5343125, 4.706625, 3.35775, 2.26715, -1.385648125, 5.730125, 3.8890625, 6.412125, 5.9983125, 6.2275, 2.25168125, 4.497375, 6.4285625, -0.7559575, 6.9259375, 4.232, 4.18984375, -1.2487155, -1.88895625, 6.276375, 3.036125, 7.174125, 2.952625, 4.9633125, -9.839125, 3.309875, 4.371375, 4.0488125, 2.2908125, 3.2808125, 1.4763625, 7.09475, 9.06825, 8.945125, 0.9220375, 6.505, 5.859625, 2.26174375, 7.26375, 4.1506875, 0.71187875, 4.8343125, 0.70271125, 5.749125, 1.9285625, 4.7614375, 8.9425625), strand = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("+", "-", "*"), class = "factor")), .Names = c("seqnames", "starts", "ends", "names", "score", "strand"), row.names = c(52L, 79L, 12L, 92L, 40L, 83L, 67L, 97L, 7L, 76L, 63L, 56L, 55L, 47L, 5L, 28L, 84L, 58L, 14L, 70L, 93L, 62L, 61L, 34L, 27L, 26L, 48L, 24L, 17L, 37L, 6L, 39L, 96L, 42L, 38L, 88L, 73L, 85L, 9L, 77L, 2L, 44L, 51L, 69L, 49L, 15L, 94L, 54L, 32L, 22L, 74L, 46L, 82L, 8L, 81L, 71L, 86L, 45L, 60L, 78L, 31L, 95L, 59L, 89L, 1L, 50L, 53L, 33L, 29L, 41L, 91L, 20L, 68L, 100L, 13L, 10L, 25L, 98L, 3L, 72L, 43L, 11L, 19L, 66L, 4L, 65L, 36L, 21L, 80L, 23L, 87L, 90L, 16L, 99L, 30L, 64L, 18L, 75L, 57L, 35L), class = "data.frame")

Save this into a bed file

write.table(df, file="./test.bed", row.names = FALSE, quote = FALSE, col.names = FALSE, sep="\t")

Import txtb and define promoters

require(TxDb.Hsapiens.UCSC.hg19.knownGene) testpromoter <- getPromoters(txdb, upstream = 100, downstream = 100) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

I am just simply trying to get a plot of the actual values in column 5, that range from -9 to 9 on the y-axis against distance from TSS on the x-axis. However both these arguments, with and without the weight column, seem to default back to giving the read count frequency...

plotAvgProf2(peak = "./test.bed", TxDb = txdb, upstream = 100, downstream = 100, xlab = "position", ylab = "values", weightCol = "V5") 

plotAvgProf2(peak = "./test.bed", TxDb = txdb, upstream = 100, downstream = 100, xlab = "position", ylab = "read count frquency") 

summary(df$V5) Min. 1st Qu. Median Mean 3rd Qu. Max. -9.839 2.229 3.745 3.446 5.293 9.922

defaultTagMatrix <- getTagMatrix(peak = "./test.bed", weightCol = "V5", windows = testpromoter)
dim(defaultTagMatrix) # outputs [59, 201].

Maybe I should just be using R to plot the numbers in the defaultTagMatrix against distance from TSS (could you please hint where I could find the distance from TSS for each peak so that I can plot simply the distribution of V5 against the TSS?) Thank you immensely!!

GuangchuangYu commented 9 years ago

Of course we use the 'V5' column if you specify weightCol="V5".

With @Puriney 's code, you can check the output of testTagMatrix:

> testTagMatrix[, 1:6]
           [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
2589 -0.3931731 -0.3931731 -0.3931731 -0.3931731 -0.3931731 -0.3931731
8090 -2.0110875 -2.0110875 -2.0110875 -2.0110875 -2.0110875 -2.0110875

The row values -0.3931731, -2.0110875 are exactly the V5 values of the two peaks that overlap with the specific window (testpromoter)

peak$V5 [1] 4.26362500 1.02253813 1.66370625 2.46050000 -0.05424813 2.32161250 [7] 4.59450000 4.18512500 -2.98806250 -0.45283000 -0.63163250 5.75768750 [13] 3.49087500 3.88818750 1.44175625 0.13221169 5.41687500 -2.01108750 [19] 2.06468750 1.25044500 -0.39317313 4.31706250 0.34850000 4.19975000 [25] 2.49812500 1.85575000 2.87668750 3.54625000 1.63593125 0.59190625

francycasa commented 9 years ago

Thank you... shouldn't the figure then just be a plot of c(-0.3931731, -2.0110875) by the distance where the overlaps map? This was simply what I was looking for... Just a plot of those two values with regards to the position from the TSS, can I maybe get that distance from these objects?

I think my misunderstanding of the function comes down to the fact that when I increase the region for upstream and downstream I would think I should see more overlaps, but again I think I am misinterpreting the function and need something different. So for example:

##starting from the small df
df=structure(list(seqnames = structure(c(3L, 14L, 2L, 10L, 12L, 
10L, 11L, 3L, 8L, 7L, 13L, 4L, 10L, 10L, 12L, 14L, 9L, 10L, 14L, 
1L, 5L, 11L, 10L, 4L, 12L, 8L, 11L, 3L, 6L, 13L), .Label = c("chr1", 
"chr11", "chr12", "chr13", "chr14", "chr16", "chr18", "chr2", 
"chr21", "chr3", "chr4", "chr5", "chr6", "chr9"), class = "factor"), 
    start = c(70455506L, 102221582L, 106400085L, 101414147L, 
    167474340L, 107595424L, 161734677L, 44432773L, 8782698L, 
    46041046L, 21168658L, 57022842L, 155593697L, 34921651L, 57794114L, 
    27246038L, 23021006L, 167966332L, 87983299L, 200978983L, 
    64319370L, 111431760L, 158430559L, 103686672L, 118240678L, 
    57391070L, 5859706L, 72454618L, 81623520L, 88661076L), end = c(70457187L, 
    102222835L, 106401476L, 101415742L, 167475181L, 107596088L, 
    161735528L, 44434836L, 8783184L, 46041591L, 21169245L, 57024326L, 
    155594174L, 34922565L, 57794591L, 27246685L, 23022067L, 167968318L, 
    87984464L, 200979989L, 64320622L, 111432684L, 158431595L, 
    103687429L, 118241512L, 57391747L, 5860370L, 72455468L, 81624282L, 
    88662230L), names = structure(c(6L, 3L, 4L, 17L, 28L, 18L, 
    25L, 5L, 13L, 12L, 29L, 8L, 19L, 16L, 26L, 1L, 15L, 21L, 
    2L, 23L, 10L, 24L, 20L, 9L, 27L, 14L, 22L, 7L, 11L, 30L), .Label = c("Peak_107511", 
    "Peak_109152", "Peak_109573", "Peak_18281", "Peak_21155", 
    "Peak_22306", "Peak_22404", "Peak_26475", "Peak_28815", "Peak_31979", 
    "Peak_37945", "Peak_41479", "Peak_44216", "Peak_45858", "Peak_55810", 
    "Peak_58971", "Peak_61362", "Peak_61714", "Peak_63686", "Peak_63857", 
    "Peak_64426", "Peak_65741", "Peak_6791", "Peak_70880", "Peak_73699", 
    "Peak_78677", "Peak_81700", "Peak_83750", "Peak_84787", "Peak_87836"
    ), class = "factor"), scores = c(4.263625, 1.022538125, 1.66370625, 
    2.4605, -0.054248125, 2.3216125, 4.5945, 4.185125, -2.9880625, 
    -0.45283, -0.6316325, 5.7576875, 3.490875, 3.8881875, 1.44175625, 
    0.1322116875, 5.416875, -2.0110875, 2.0646875, 1.250445, 
    -0.393173125, 4.3170625, 0.3485, 4.19975, 2.498125, 1.85575, 
    2.8766875, 3.54625, 1.63593125, 0.59190625)), .Names = c("seqnames", 
"start", "end", "names", "scores"), class = "data.frame", row.names = c(22306L, 
109573L, 18281L, 61362L, 83750L, 61714L, 73699L, 21155L, 44216L, 
41479L, 84787L, 26475L, 63686L, 58971L, 78677L, 107511L, 55810L, 
64426L, 109152L, 6791L, 31979L, 70880L, 63857L, 28815L, 81700L, 
45858L, 65741L, 22404L, 37945L, 87836L))

write.table(df, file="./test.bed", row.names = FALSE, quote = FALSE, col.names = FALSE, sep="\t")

##This is what you are getting within +/-100 bases from the TSS
plotAvgProf2(peak ="./test.bed", TxDb=txdb, xlab = "Genomic Region (5'->3')", ylab = "Score", upstream=100, downstream=100, weightCol="V5")

But then increasing the upstream/downstream I would still like to see the V5 values plotted on the y-axis, but why do I get these different ranges now on the y-axis? This is what I get when I increase the window (which is why I thought the function was plotting the read frequencies):

plotAvgProf2(peak ="./test.bed", TxDb=txdb, xlab = "Genomic Region (5'->3')", ylab = "Score", upstream=3000, downstream=3000, weightCol="V5")

I think the function is getting a frequency here: ss <- colSums(testTagMatrix) ss <- ss/sum(ss)

While what I would like is just the average value of V5 when there is an overlap...

I could get just simply the distances from TSS for those 2 overlaps from the stored objects (something similar to "distanceToTSS" with the value of V5)? It would really help me understand, so I can then plot the distance from TSS against the V5, which I can get -- as you kindly explained -- with this: testTagMatrix[,1]

I think it is possible to get these from (example for the first peak file):

peakAnno <- as.data.frame(peakAnnoList[[1]])

And then plotting these values just to show the different distributions:

plot(peakAnno$distanceToTSS, peakAnno$score)

But if there is a simpler and prettier way to plot all the distributions together as you have your beautiful plots in the package, and using List functions that I am not seeing please let me know... Thank you again so much!!

GuangchuangYu commented 8 years ago

I think you don't understand plotAvgProf function. It's designed to visualize how peaks binding to promoter regions (now with getBioRegion, it can be intron/exon start region).

It consider peak intensity if weightCol is specify, and consider the width of peaks.

I think the function is getting a frequency here: ss <- colSums(testTagMatrix) ss <- ss/sum(ss)

We have very clearly explain to you that testTagMatrix do use weightCol, see also in the source code.

This directly answer the issue your reported.

You expect we output a vector/column of raw weightCol values and visualize those values directly, but this is not the way we implemented in plotAvgProf(). Please do read the vignette and don't get us wrong.

If you want to request a new feature, start a new issue and please explain your idea clearly. At least let us understand what you trying to express (not in the way in this post which is obscured). If @Puriney and I think it is a good idea, we will be glad to help.

xpan1 commented 5 years ago

Hi, What are the weightCol values? What does that mean? What the values are, are they read count in the peak region, or fold enrichment, or P-value? I am a little bit confused about that.

More specifically, my question would be what is the input for the peakHeatmap and plotAvgProf2? I know the inputs are the bed files from macs, but the problem is which column from the bed file? The chr, peak start, peak end, peak name, and what else, I don't know what values I should use for the fifth column?

Puriney commented 5 years ago

What the value means depends on the users' input and analysis goal. It could be any value exactly as you mentioned here. A common choice is fold enrichment given by MACS2.

If weightCol is not used in plotAvgProf, it simply counts how many ChIP-seq peak regions (reported by MACS2) around the TSS. If it is a TF, it is not surprising to observe the TSS has a peak. In this case, it assumes all ChIP-seq peak regions are the same but they are probably not; this is what fold-enrichment can help and is handled by the weightCol.

xpan1 commented 5 years ago

So you mean for plotAvgProf2 results, the y-axis indicate read count frequency, so the read count frequency means how many ChIP-seq peak regions (does that mean how many peak numbers rather than the read counts in the peak region) around TSS ? Thank you very much..

hyjforesight commented 2 years ago

hello @Puriney @GuangchuangYu @MingLi-929 Appreciate it if you could tell me what exactly weightcol='V5' is doing here? Is it using a weighting algorithm to balance the values of column 'V5'?不好意思,不是很明白weightcol=V5的作用,请问是不是对'V5'列的值做了加权?谢谢!

MingLi-929 commented 2 years ago

Strictly, the weightcol weight the each base in the peak area instead of weighing the 'V5' column. In a word, you can interpret the weightcol='V5' as assigning value in the V5 column to each base in the peak area.

library(ChIPseeker)
library(GenomicRanges)

files <- ChIPseeker::getSampleFiles()

peak <- files[[4]]
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
peak_gr <- readPeakFile(peak)

matrix <- ChIPseeker::getTagMatrix(peak = peak_gr[1], 
                                   TxDb = txdb,
                                   upstream = 3000,
                                   downstream = 3000, 
                                   type = "start_site",
                                   by = "gene",
                                   weightCol = "V5")

matrix_without_WC <- ChIPseeker::getTagMatrix(peak = peak_gr[1], 
                                              TxDb = txdb,
                                              upstream = 3000,
                                              downstream = 3000, 
                                              type = "start_site",
                                              by = "gene")

> peak_gr[1]
GRanges object with 1 range and 2 metadata columns:
      seqnames        ranges strand |          V4        V5
         <Rle>     <IRanges>  <Rle> | <character> <numeric>
  [1]     chr1 815093-817883      * | MACS_peak_1    295.76
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

with weight columns we can see this

> matrix
   [1] 295.76 295.76 295.76 295.76 295.76 295.76 295.76
   [8] 295.76 295.76 295.76 295.76 295.76 295.76 295.76
..........

without weightcol

> matrix_without_WC
   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [28] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
.......
hyjforesight commented 2 years ago

hello @MingLi-929 Thanks for the response.不知道我这样理解对不对,weight指的是峰的”重量weight“,而不是”加权算法weight“, weightcol将peak的值赋给了每一个peak。 Because different settings of arguments weightcol, nbin, and 'conf' make different plots, which setting is the best practice for plotting the TSS heatmap, TSS peak map, and gene body peak map, respectively? For TSS heatmap, which is better, weightcol=NULL or weightcol='V5'? image For TSS peak map, which is better, weightcol=NULL or weightcol='V5'? image For gene body peak heatmap, which is better, weightcol=NULL or weightcol='V5'? image

For MACS2, we can get the qValue, pValue and signalValue of peaks, shall we do `weightcol=signalValue'? Thanks! Best, YJ

MingLi-929 commented 2 years ago

weight指的是峰的”重量weight“,而不是”加权算法weight“, weightcol将peak的值赋给了每一个peak。

well, if i do not mistake your interpretation, i agree with u personally. weightcol assigns values to every peak.

Because different settings of arguments weightcol, nbin, and 'conf' make different plots, which setting is the best practice for plotting the TSS heatmap, TSS peak map, and gene body peak map, respectively? For TSS heatmap, which is better, weightcol=NULL or weightcol='V5'? For TSS peak map, which is better, weightcol=NULL or weightcol='V5'? For gene body peak heatmap, which is better, weightcol=NULL or weightcol='V5'?

I am sorry that i can not give u a fixed answer to decide which one is better. Using the parameter or not depended on your personal need and the need of research. But from my personal view, if you can get some meaningful value, like the signalValue, it will be great to use it.

hyjforesight commented 2 years ago

Thanks @MingLi-929 for the response!