Closed mgperry closed 7 years ago
Hi @mgperry
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: heatmaps
Title: Flexible Heatmaps for Functional Genomics and Sequence Features
Version: 1.2.0
Date: 18-01-2016
Author: Malcolm Perry <mgperry32@gmail.com>
Maintainer: Malcolm Perry <mgperry32@gmail.com>
Depends:
R (>= 3.4)
Imports:
Biostrings,
GenomicRanges,
IRanges,
KernSmooth,
plotrix,
methods,
Matrix,
genomation,
EBImage,
spatstat,
RColorBrewer,
BiocGenerics,
GenomeInfoDb
Suggests:
BSgenome.Drerio.UCSC.danRer7,
knitr
biocViews: Visualization, SequenceMatching
License: Artistic-2.0
Description: This package provides functions for plotting heatmaps of genome-wide
data across genomic intervals, such as ChIP-seq signals at peaks or across promoters.
Many functions are also provided for investigating sequence features.
biocViews: Visualization, SequenceMatching
Collate:
Heatmap-class.R
PlotHeatmap.R
PlotHeatmapList.R
PlotPatternDensityMap.R
PWMScanHeatmap.R
PatternHeatmap.R
CoverageHeatmap.R
smooth.R
plotHeatmapMeta.R
Data.R
NeedsCompilation: no
RoxygenNote: 5.0.1
@mgperry if this is a new package then its version should be 0.99.0. I guess you have an existing version of your package somewhere. Can you arrange for your version number x.y.z to have an odd 'y', because it will be added to the 'devel' branch and in the devel branch y is odd. z should be larger than any existing z for the odd numbered y. The version number will be bumped to x.y+1.0 in the next release.
Checked out the first file.
windowViews()
could be simplified to cog[gr]
although I don't think that reverses elements with negative strand. Maybe it should, @hpages? A workaround for now would be using revElements()
instead of the current loop. To convert a List to a matrix, just use as.matrix()
instead of the current loop/rbind construct.{()
in generic definitions. Just do function(x) standardGeneric("foo")
. Using {()
, the generic becomes a non-standard generic that just delegates to the standard one.CoverageHeatmap()
methods emit error messages referring to DNAStringSet, which does not seem relevant.track
argument to come first, before the regions of interest, since it is the coverage that we are plotting.ScoreMatrixBin()
and following matrix coercion could be replaced with simple code like:tiles <- tile(windows, nbin)
matrix(mean(track[unlist(tiles)]), ncol=nbin, byrow=TRUE)
Thanks Michael, that's all very helpful. I'll look into the generics, I have to say I'm not familiar enough with S4 to know the difference. You're definitely right about the error messages. I'm not sure I share your intuition about argument order, and I think what there is currently aligns better with the PatternHeatmap constructor.
I'm not terribly sure how cov[gr] works - I think last time the subject came up on the devel list, a similar idiom to what I'm using was proposed but it's obviously been superseded (I've always thought it odd that such a convoluted step was necessary). I'll test to see if it does reverse ranges, but either way it will be quicker to invert the rows of the resulting matrix afterwards. I also didn't know about tile(), I would like to eliminate the dependency if possible.
Your package has been approved for building. Your package is now submitted to our queue.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
The vignette needs some work. It should be demonstrating the software with a real-world use case. Instead, it directly dives into the low-level details of the Heatmap class, and then more or less lists the types of plots it can make, only occasionally using real data, outside of a cohesive workflow. It's fine to summarize the capabilities in the introduction, and dive into details at the end, but the vignette needs to be better motivated. Also, please do not use str(x)
to describe the structure of a class. The show()
method should already make the user-level details obvious.
@lawremi I've implemented those changes in mgperry@heatmaps/4f60e670 and mgperry@heatmaps/d68d6c.
cog[gr]
does in fact ignore strand, but revElements
works fine. I don't think as.matrix
works directly, it produces a matrix of Rle
s (which must surely be automatic - I doubt it's intended), but the code in the tiles
example works fine, swapping mean
for unlist(use.name=F)
.
@mtmorgan The package hasn't been uploaded elsewhere, I've set the version back to 0.99.0
@lawremi Entirely agree with your comments on the vignette, the submission deadline took me by suprise a bit and I focused on documentation and the code itself, I should have a much better version soon.
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
I fixed the matrix coercion so that it will decode the Rle. IRanges 2.7.16.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930123334.html
Received a valid push; starting a build. Commits are:
54354e9 add data for examples (previously excluded by giti...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930124611.html
Received a valid push; starting a build. Commits are:
5f4901b R CMD BUILD can now see vignettes
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930133034.html
Received a valid push; starting a build. Commits are:
c9e9d1b missed comma
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930134927.html
Hi Malcolm,
Thanks for this new contribution and for working on the vignette. Do you have an ETA for when the better version of the vignette will be available? Please note that all the calls to plotHeatmap()
and plotHeatmapList()
in the vignette generate the following somewhat obscure warning:
Warning message in heatmapOptions(...) : Arguments are not heatmap options
Thanks, H.
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
Received a valid push; starting a build. Commits are:
555531d version bump
Hi Herve, the vignette should be complete by Monday, and I should have most of it up by the end of work today (UK time). I've fixed the warning in plotHeatmap.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20161007054506.html
Received a valid push; starting a build. Commits are:
6e3d6ed created user guide e1b821c remove silly file 90e26ef Added validity checking to Heatmaps 2ca989d fix legend text size 6d20fb2 fix incorrect messages during plotting b57b7b1 version bump e2305f1 fix bug in image size reduction ee5d7e6 add news file d776197 added vignette to complement user guide
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20161011081616.html
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
Received a valid push; starting a build. Commits are:
d97d4e8 plus version bump
@hpages, I have completed the vignette, hopefully it builds fine on the server (for some reason it could pass building on my machine despite missing files).
What is the limit on the size of data included with the package? I would like to include around 7MB in order to make pretty figures. These also take up a fair amount of room pushing the vignette to ~4MB, although I could drop the dpi it uses. Are either of these things an issue for Bioconductor?
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20161011084555.html
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
Hi Malcolm,
Thanks for improving the vignette. The source tarball is already a little bit too big (9.3M, should occupy less than 4MB on disk per our guidelines) but I'll pass on this (please don't make it bigger). A more serious problem is that the vignette uses 4.8GB of memory, which is too much. R cannot allocate more than 2GB of memory on 32-bit Windows so people using that platform won't be able to run the code in the vignette. Please try to reduce the memory usage of the vignette.
Thanks, H.
Ah ok, I didn't realise it was that bad. The code I'm using is fairly memory-intensive, since it basically loads the whole image matrix into memory (one value per base pair) before operating on it.
It would be easy in theory load bin the data row-by-row but I'm not sure it would be easy to implement this in R without it getting very slow, ie how to do something like this:
tiles <- tile(windows, nbin)
matrix(mean(track[unlist(tiles)]), ncol=nbin, byrow=TRUE)
but summarising each row as it is read in.
For now I'll try and reduce the amount of data in the plots which should help. What's the easiest way to see the max memory usage involved in running an R file?
Also, presumably deleting large objects with rm()
and gc()
after they are needed could reduce peak memory usage?
Hi Malcolm,
To see the max memory usage involved in running an R file, I monitor memory usage by running the Unix top command in one terminal while I run the code in the file.in an another terminal.
Yes deleting large objects from the global environment after they are not needed anymore could help reduce peak memory usage. Note that doing this for temporary objects created inside a function wouldn't help because these objects are automatically deleted after the function returns.
H.
Hi Herve,
I think I've tracked down the source of the problem. Calling lapply
on a 12MB DNAStringSet results in an implicit list conversion, which according to lineprof results in allocation and deallocation of around 4GB of memory. Otherwise none of the objects/data are anything near that size, I think the single biggest matrix is around 50MB.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20161014082447.html
Hi Perry,
2 things:
1) Calling lapply()
on a DNAStringSet object calls the "lapply"
method for List objects which is defined as follow:
> selectMethod("lapply", "DNAStringSet")
Method Definition:
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
ii <- seq_len(length(X))
names(ii) <- names(X)
lapply(ii, function(i) FUN(X[[i]], ...))
}
<environment: namespace:S4Vectors>
Signatures:
X
target "DNAStringSet"
defined "List"
As you can see it carefully avoids coercing X
to an ordinary list.
2) It's hard to believe that coercing a 12MB DNAStringSet object to list would trigger allocation of 4GB of memory. Trying this on a 7MB DNAStringSet object:
> library(Biostrings)
> library(hgu95av2probe)
> probes <- DNAStringSet(hgu95av2probe)
> probes
A DNAStringSet instance of length 201800
width seq
[1] 25 TGGCTCCTGCTGAGGTCCCCTTTCC
[2] 25 GGCTGTGAATTCCTGTACATATTTC
[3] 25 GCTTCAATTCCATTATGTTTTAATG
[4] 25 GCCGTTTGACAGAGCATGCTCTGCG
[5] 25 TGACAGAGCATGCTCTGCGTTGTTG
... ... ...
[201796] 25 AGATGGATAGCCTTCTGTCAAAGCA
[201797] 25 ATAGCCTTCTGTCAAAGCATCATCT
[201798] 25 TTCTGTCAAAGCATCATCTCAACAA
[201799] 25 CAAAGCATCATCTCAACAAGCCCTC
[201800] 25 GTGCTCCTTGTCAACAGCGCACCCA
> object.size(probes)
7470440 bytes
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1778416 95.0 3205452 171.2 2023626 108.1
Vcells 3962403 30.3 6536286 49.9 4353747 33.3
The top
Unix command reports:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9466 biocbui+ 20 0 332080 166444 11224 S 0.0 0.5 0:02.84 R
Then:
> probes2 <- as.list(probes)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4603657 245.9 6861544 366.5 6861544 366.5
Vcells 4567830 34.9 7923543 60.5 5116576 39.1
Now the top
Unix command reports:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9466 biocbui+ 20 0 591860 426348 11340 S 0.0 1.3 1:27.10 R
So it looks like memory usage increased in a reasonable way.
Note that object.size()
is not reliable on probes2
:
> object.size(probes2)
1018507201640 bytes
That's because the list elements in probes2
are DNAString objects that all point to the same chunk of memory where the sequence data is stored:
> probes2[[1]]
25-letter "DNAString" instance
seq: TGGCTCCTGCTGAGGTCCCCTTTCC
> probes2[[1]]@shared
SharedRaw of length 5045000 (data starting at address 0x7fe475039038)
> object.size(probes2[[1]])
5047104 bytes
> probes2[[2]]
25-letter "DNAString" instance
seq: GGCTGTGAATTCCTGTACATATTTC
> probes2[[2]]@shared
SharedRaw of length 5045000 (data starting at address 0x7fe475039038)
> object.size(probes2[[2]])
5047104 bytes
This is actually the same chunk of memory where the sequence data in probes
is stored:
> probes@pool
SharedRaw_Pool of length 1
1: SharedRaw of length 5045000 (data starting at address 0x7fe475039038)
So as.list()
didn't copy the sequence data. The memory where the sequence data lives is shared between probes
and all the list elements in probes2
.
Hope this helps, H.
Hi Herve,
Thanks for the detailed response. I think I've misinterpreted the output of lineprof
, since it gives the total memory allocated and deallocated rather than the total in use at once. The functions I changed don't cause a huge rise in memory usage overall, although they do run much faster.
I think the memory usage is caused by the EBImage::blur
function. I'm not sure why it's using around 3GB of memory to blur a 60MB matrix, and it certainly isn't producing a large object as a result (the return value is ~1MB).
At this point I think the only option is to delete that example from the vignette, if memory usage is an issue. Including the result pre-calculated would increase the package size by a further megabyte. This would be unfortunate, since I think it's a nice example of what the package can do.
I'm not that familiar with gc
in R, but are you sure that this command would fail on a low memory system? My understanding is that gc()
is called when there is memory pressure, so it wouldn't be called on my system (with 8GB of RAM), or yours (I presume). Calling gc()
manually after the blur function returns frees around 3GB of RAM (which is the problematic part I think), so there are obviously unused object which are not reclaimed during function execution.
Unfortunately there is no OSX equivalent of memory.limit
so it's hard to tell if this is the case or not. I can install heatmaps and test this on a windows machine if this would help.
Hi Malcolm,
I'm not sure I understand what you're asking about gc()
. I called gc()
in my previous post only to show you memory usage. I was not suggesting that you use it in your code. It wouldn't help reduce memory usage and is actually considered bad practice.
Note that your smooth()
function calls the blur()
function from the spatstat package, not EBImage::blur()
, which BTW is defunct (was replaced by gblur
, see ?EBImage::blur
). If spatstat::blur()
is using around 3GB of memory to blur a 60MB matrix, then you should use something else.
Please don't call your smoothing function smooth
. That will conflict with well-known stats::smooth()
. Note that your package defines a smooth,ANY method that doesn't make sense:
> showMethods("smooth")
Function: smooth (package heatmaps)
heatmap="ANY"
heatmap="Heatmap"
> selectMethod("smooth", "ANY")
Method Definition (Class "derivedDefaultMethod"):
function (heatmap, ...)
StandardGeneric("smooth")
<environment: namespace:heatmaps>
Signatures:
heatmap
target "ANY"
defined "ANY"
> str(tata_pwm_scan_30p)
Formal class 'Heatmap' [package "heatmaps"] with 6 slots
..@ image : num [1:8365, 1:1000] 49.1 74 68.1 74.6 69.3 ...
..@ scale : num [1:2] 0 100
..@ coords : int [1:2] -500 500
..@ nseq : int 8365
..@ label : chr "TATA"
..@ metadata: list()
> smooth(tata_pwm_scan_30p@image)
Error in smooth(tata_pwm_scan_30p@image) :
could not find function "StandardGeneric"
Probably because of the misspelling of StandardGeneric
in your setGeneric
statement (it should be standardGeneric
with lower-case s).
Please use <-
instead of =
for assignment (in your vignettes, examples, etc...)
At this point I'm afraid that the package won't be ready for this release (we can't add new packages to the release after 11 am today, Seattle time). It's not a big deal, we'll add it to BioC 3.5 when it's ready. Sorry for that.
Thanks for your comprehension, H.
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
Hi Malcolm,
Just wanted to check with you what the status of the package is. Are you planning to follow-up with your submission?
Thanks, H.
Hi Herve! I was definitely planning to resubmit, and I've fixed the major issue with memory use by removing the dependency on spatstat, and reimplementing the functionality with the excellent EBImage. I wasn't sure when you were accepting submissions, but from the activity in the issues feed then I guess now is the time! I will make some changes and then resubmit the issue next week, I think it would be easier to have a clean start. If you can close this issue that would be great.
Hi Malcolm, I'd prefer if we continue the submission in this issue. Then all the submission/review history is in one place. Thanks, H.
Hi Herve, I think I've fixed many of the major performance issues in the previous versions. Specifically:
I've also added support for visualising clusters, and I will add this to vignette shortly. I also plan to improve the examples in the vignette so that they look better (but all the basic functions are there).
Thanks, Malcolm
Received a valid push; starting a build. Commits are:
22a960e clean up vignette and bump version
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20170327115315.html
Received a valid push; starting a build. Commits are:
957bf37 fix documentation and examples
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20170328102656.html
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/heatmaps_buildreport_20170330073624.html
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.