Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

heatmaps #151

Closed mgperry closed 7 years ago

mgperry commented 7 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 7 years ago

Hi @mgperry

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: heatmaps
Title: Flexible Heatmaps for Functional Genomics and Sequence Features
Version: 1.2.0
Date: 18-01-2016
Author: Malcolm Perry <mgperry32@gmail.com>
Maintainer: Malcolm Perry <mgperry32@gmail.com>
Depends:
    R (>= 3.4)
Imports:
    Biostrings,
    GenomicRanges,
    IRanges,
    KernSmooth,
    plotrix,
    methods,
    Matrix,
    genomation,
    EBImage,
    spatstat,
    RColorBrewer,
    BiocGenerics,
    GenomeInfoDb
Suggests:
    BSgenome.Drerio.UCSC.danRer7,
    knitr
biocViews: Visualization, SequenceMatching
License: Artistic-2.0
Description: This package provides functions for plotting heatmaps of genome-wide
   data across genomic intervals, such as ChIP-seq signals at peaks or across promoters.
   Many functions are also provided for investigating sequence features.
   biocViews: Visualization, SequenceMatching
Collate:
    Heatmap-class.R
    PlotHeatmap.R
    PlotHeatmapList.R
    PlotPatternDensityMap.R
    PWMScanHeatmap.R
    PatternHeatmap.R
    CoverageHeatmap.R
    smooth.R
    plotHeatmapMeta.R
    Data.R
NeedsCompilation: no
RoxygenNote: 5.0.1
mtmorgan commented 7 years ago

@mgperry if this is a new package then its version should be 0.99.0. I guess you have an existing version of your package somewhere. Can you arrange for your version number x.y.z to have an odd 'y', because it will be added to the 'devel' branch and in the devel branch y is odd. z should be larger than any existing z for the odd numbered y. The version number will be bumped to x.y+1.0 in the next release.

lawremi commented 7 years ago

Checked out the first file.

  1. windowViews() could be simplified to cog[gr] although I don't think that reverses elements with negative strand. Maybe it should, @hpages? A workaround for now would be using revElements() instead of the current loop. To convert a List to a matrix, just use as.matrix() instead of the current loop/rbind construct.
  2. Be careful about using {() in generic definitions. Just do function(x) standardGeneric("foo"). Using {(), the generic becomes a non-standard generic that just delegates to the standard one.
  3. CoverageHeatmap() methods emit error messages referring to DNAStringSet, which does not seem relevant.
  4. My intuition would be for the track argument to come first, before the regions of interest, since it is the coverage that we are plotting.
  5. The call to ScoreMatrixBin() and following matrix coercion could be replaced with simple code like:
tiles <- tile(windows, nbin)
matrix(mean(track[unlist(tiles)]), ncol=nbin, byrow=TRUE)
mgperry commented 7 years ago

Thanks Michael, that's all very helpful. I'll look into the generics, I have to say I'm not familiar enough with S4 to know the difference. You're definitely right about the error messages. I'm not sure I share your intuition about argument order, and I think what there is currently aligns better with the PatternHeatmap constructor.

I'm not terribly sure how cov[gr] works - I think last time the subject came up on the devel list, a similar idiom to what I'm using was proposed but it's obviously been superseded (I've always thought it odd that such a convoluted step was necessary). I'll test to see if it does reverse ranges, but either way it will be quicker to invert the rows of the resulting matrix afterwards. I also didn't know about tile(), I would like to eliminate the dependency if possible.

bioc-issue-bot commented 7 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

lawremi commented 7 years ago

The vignette needs some work. It should be demonstrating the software with a real-world use case. Instead, it directly dives into the low-level details of the Heatmap class, and then more or less lists the types of plots it can make, only occasionally using real data, outside of a cohesive workflow. It's fine to summarize the capabilities in the introduction, and dive into details at the end, but the vignette needs to be better motivated. Also, please do not use str(x) to describe the structure of a class. The show() method should already make the user-level details obvious.

mgperry commented 7 years ago

@lawremi I've implemented those changes in mgperry@heatmaps/4f60e670 and mgperry@heatmaps/d68d6c.

cog[gr] does in fact ignore strand, but revElements works fine. I don't think as.matrix works directly, it produces a matrix of Rles (which must surely be automatic - I doubt it's intended), but the code in the tiles example works fine, swapping mean for unlist(use.name=F).

@mtmorgan The package hasn't been uploaded elsewhere, I've set the version back to 0.99.0

mgperry commented 7 years ago

@lawremi Entirely agree with your comments on the vignette, the submission deadline took me by suprise a bit and I focused on documentation and the code itself, I should have a much better version soon.

bioc-issue-bot commented 7 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

lawremi commented 7 years ago

I fixed the matrix coercion so that it will decode the Rle. IRanges 2.7.16.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

9dd7003 new as.matrix coercion from IRanges 230feef methods import 5a051b5 add @value to functions (bioccheck) b2194e7 version bump for build bot

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930123334.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

54354e9 add data for examples (previously excluded by giti...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930124611.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

5f4901b R CMD BUILD can now see vignettes

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930133034.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

c9e9d1b missed comma

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20160930134927.html

hpages commented 7 years ago

Hi Malcolm,

Thanks for this new contribution and for working on the vignette. Do you have an ETA for when the better version of the vignette will be available? Please note that all the calls to plotHeatmap() and plotHeatmapList() in the vignette generate the following somewhat obscure warning:

    Warning message in heatmapOptions(...) : Arguments are not heatmap options

Thanks, H.

bioc-issue-bot commented 7 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

555531d version bump

mgperry commented 7 years ago

Hi Herve, the vignette should be complete by Monday, and I should have most of it up by the end of work today (UK time). I've fixed the warning in plotHeatmap.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20161007054506.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

6e3d6ed created user guide e1b821c remove silly file 90e26ef Added validity checking to Heatmaps 2ca989d fix legend text size 6d20fb2 fix incorrect messages during plotting b57b7b1 version bump e2305f1 fix bug in image size reduction ee5d7e6 add news file d776197 added vignette to complement user guide

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20161011081616.html

bioc-issue-bot commented 7 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

d97d4e8 plus version bump

mgperry commented 7 years ago

@hpages, I have completed the vignette, hopefully it builds fine on the server (for some reason it could pass building on my machine despite missing files).

What is the limit on the size of data included with the package? I would like to include around 7MB in order to make pretty figures. These also take up a fair amount of room pushing the vignette to ~4MB, although I could drop the dpi it uses. Are either of these things an issue for Bioconductor?

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20161011084555.html

bioc-issue-bot commented 7 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

hpages commented 7 years ago

Hi Malcolm,

Thanks for improving the vignette. The source tarball is already a little bit too big (9.3M, should occupy less than 4MB on disk per our guidelines) but I'll pass on this (please don't make it bigger). A more serious problem is that the vignette uses 4.8GB of memory, which is too much. R cannot allocate more than 2GB of memory on 32-bit Windows so people using that platform won't be able to run the code in the vignette. Please try to reduce the memory usage of the vignette.

Thanks, H.

mgperry commented 7 years ago

Ah ok, I didn't realise it was that bad. The code I'm using is fairly memory-intensive, since it basically loads the whole image matrix into memory (one value per base pair) before operating on it.

It would be easy in theory load bin the data row-by-row but I'm not sure it would be easy to implement this in R without it getting very slow, ie how to do something like this:

tiles <- tile(windows, nbin)
matrix(mean(track[unlist(tiles)]), ncol=nbin, byrow=TRUE)

but summarising each row as it is read in.

For now I'll try and reduce the amount of data in the plots which should help. What's the easiest way to see the max memory usage involved in running an R file?

mgperry commented 7 years ago

Also, presumably deleting large objects with rm() and gc() after they are needed could reduce peak memory usage?

hpages commented 7 years ago

Hi Malcolm,

To see the max memory usage involved in running an R file, I monitor memory usage by running the Unix top command in one terminal while I run the code in the file.in an another terminal.

Yes deleting large objects from the global environment after they are not needed anymore could help reduce peak memory usage. Note that doing this for temporary objects created inside a function wouldn't help because these objects are automatically deleted after the function returns.

H.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

0665e54 import IRanges::mean 0105c9e better scales for vignette bbb9fe3 fixed odd behaviour with binned heatmaps 0624c25 regen documentation + version bump

mgperry commented 7 years ago

Hi Herve,

I think I've tracked down the source of the problem. Calling lapply on a 12MB DNAStringSet results in an implicit list conversion, which according to lineprof results in allocation and deallocation of around 4GB of memory. Otherwise none of the objects/data are anything near that size, I think the single biggest matrix is around 50MB.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20161014082447.html

hpages commented 7 years ago

Hi Perry,

2 things:

1) Calling lapply() on a DNAStringSet object calls the "lapply" method for List objects which is defined as follow:

> selectMethod("lapply", "DNAStringSet")
Method Definition:

function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    ii <- seq_len(length(X))
    names(ii) <- names(X)
    lapply(ii, function(i) FUN(X[[i]], ...))
}
<environment: namespace:S4Vectors>

Signatures:
        X             
target  "DNAStringSet"
defined "List"        

As you can see it carefully avoids coercing X to an ordinary list.

2) It's hard to believe that coercing a 12MB DNAStringSet object to list would trigger allocation of 4GB of memory. Trying this on a 7MB DNAStringSet object:

> library(Biostrings)
> library(hgu95av2probe)
> probes <- DNAStringSet(hgu95av2probe)
> probes
  A DNAStringSet instance of length 201800
         width seq
     [1]    25 TGGCTCCTGCTGAGGTCCCCTTTCC
     [2]    25 GGCTGTGAATTCCTGTACATATTTC
     [3]    25 GCTTCAATTCCATTATGTTTTAATG
     [4]    25 GCCGTTTGACAGAGCATGCTCTGCG
     [5]    25 TGACAGAGCATGCTCTGCGTTGTTG
     ...   ... ...
[201796]    25 AGATGGATAGCCTTCTGTCAAAGCA
[201797]    25 ATAGCCTTCTGTCAAAGCATCATCT
[201798]    25 TTCTGTCAAAGCATCATCTCAACAA
[201799]    25 CAAAGCATCATCTCAACAAGCCCTC
[201800]    25 GTGCTCCTTGTCAACAGCGCACCCA
> object.size(probes)
7470440 bytes
> gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1778416 95.0    3205452 171.2  2023626 108.1
Vcells 3962403 30.3    6536286  49.9  4353747  33.3

The top Unix command reports:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND    
 9466 biocbui+  20   0  332080 166444  11224 S   0.0  0.5   0:02.84 R       

Then:

> probes2 <- as.list(probes)
> gc()
          used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 4603657 245.9    6861544 366.5  6861544 366.5
Vcells 4567830  34.9    7923543  60.5  5116576  39.1

Now the top Unix command reports:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9466 biocbui+  20   0  591860 426348  11340 S   0.0  1.3   1:27.10 R       

So it looks like memory usage increased in a reasonable way. Note that object.size() is not reliable on probes2:

> object.size(probes2)
1018507201640 bytes

That's because the list elements in probes2 are DNAString objects that all point to the same chunk of memory where the sequence data is stored:

> probes2[[1]]
  25-letter "DNAString" instance
seq: TGGCTCCTGCTGAGGTCCCCTTTCC
> probes2[[1]]@shared
SharedRaw of length 5045000 (data starting at address 0x7fe475039038)
> object.size(probes2[[1]])
5047104 bytes
> probes2[[2]]
  25-letter "DNAString" instance
seq: GGCTGTGAATTCCTGTACATATTTC
> probes2[[2]]@shared
SharedRaw of length 5045000 (data starting at address 0x7fe475039038)
> object.size(probes2[[2]])
5047104 bytes

This is actually the same chunk of memory where the sequence data in probes is stored:

> probes@pool
SharedRaw_Pool of length 1
1: SharedRaw of length 5045000 (data starting at address 0x7fe475039038)

So as.list() didn't copy the sequence data. The memory where the sequence data lives is shared between probes and all the list elements in probes2.

Hope this helps, H.

mgperry commented 7 years ago

Hi Herve,

Thanks for the detailed response. I think I've misinterpreted the output of lineprof, since it gives the total memory allocated and deallocated rather than the total in use at once. The functions I changed don't cause a huge rise in memory usage overall, although they do run much faster.

I think the memory usage is caused by the EBImage::blur function. I'm not sure why it's using around 3GB of memory to blur a 60MB matrix, and it certainly isn't producing a large object as a result (the return value is ~1MB).

At this point I think the only option is to delete that example from the vignette, if memory usage is an issue. Including the result pre-calculated would increase the package size by a further megabyte. This would be unfortunate, since I think it's a nice example of what the package can do.

I'm not that familiar with gc in R, but are you sure that this command would fail on a low memory system? My understanding is that gc()is called when there is memory pressure, so it wouldn't be called on my system (with 8GB of RAM), or yours (I presume). Calling gc() manually after the blur function returns frees around 3GB of RAM (which is the problematic part I think), so there are obviously unused object which are not reclaimed during function execution.

Unfortunately there is no OSX equivalent of memory.limit so it's hard to tell if this is the case or not. I can install heatmaps and test this on a windows machine if this would help.

hpages commented 7 years ago

Hi Malcolm,

I'm not sure I understand what you're asking about gc(). I called gc() in my previous post only to show you memory usage. I was not suggesting that you use it in your code. It wouldn't help reduce memory usage and is actually considered bad practice.

Note that your smooth() function calls the blur() function from the spatstat package, not EBImage::blur(), which BTW is defunct (was replaced by gblur, see ?EBImage::blur). If spatstat::blur() is using around 3GB of memory to blur a 60MB matrix, then you should use something else.

Please don't call your smoothing function smooth. That will conflict with well-known stats::smooth(). Note that your package defines a smooth,ANY method that doesn't make sense:

> showMethods("smooth")
Function: smooth (package heatmaps)
heatmap="ANY"
heatmap="Heatmap"

> selectMethod("smooth", "ANY")
Method Definition (Class "derivedDefaultMethod"):

function (heatmap, ...) 
StandardGeneric("smooth")
<environment: namespace:heatmaps>

Signatures:
        heatmap
target  "ANY"  
defined "ANY"  

> str(tata_pwm_scan_30p)
Formal class 'Heatmap' [package "heatmaps"] with 6 slots
  ..@ image   : num [1:8365, 1:1000] 49.1 74 68.1 74.6 69.3 ...
  ..@ scale   : num [1:2] 0 100
  ..@ coords  : int [1:2] -500 500
  ..@ nseq    : int 8365
  ..@ label   : chr "TATA"
  ..@ metadata: list()

> smooth(tata_pwm_scan_30p@image)
Error in smooth(tata_pwm_scan_30p@image) : 
  could not find function "StandardGeneric"

Probably because of the misspelling of StandardGeneric in your setGeneric statement (it should be standardGeneric with lower-case s).

Please use <- instead of = for assignment (in your vignettes, examples, etc...)

At this point I'm afraid that the package won't be ready for this release (we can't add new packages to the release after 11 am today, Seattle time). It's not a big deal, we'll add it to BioC 3.5 when it's ready. Sorry for that.

Thanks for your comprehension, H.

bioc-issue-bot commented 7 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

hpages commented 7 years ago

Hi Malcolm,

Just wanted to check with you what the status of the package is. Are you planning to follow-up with your submission?

Thanks, H.

mgperry commented 7 years ago

Hi Herve! I was definitely planning to resubmit, and I've fixed the major issue with memory use by removing the dependency on spatstat, and reimplementing the functionality with the excellent EBImage. I wasn't sure when you were accepting submissions, but from the activity in the issues feed then I guess now is the time! I will make some changes and then resubmit the issue next week, I think it would be easier to have a clean start. If you can close this issue that would be great.

hpages commented 7 years ago

Hi Malcolm, I'd prefer if we continue the submission in this issue. Then all the submission/review history is in one place. Thanks, H.

mgperry commented 7 years ago

Hi Herve, I think I've fixed many of the major performance issues in the previous versions. Specifically:

I've also added support for visualising clusters, and I will add this to vignette shortly. I also plan to improve the examples in the vignette so that they look better (but all the basic functions are there).

Thanks, Malcolm

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

22a960e clean up vignette and bump version

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20170327115315.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

957bf37 fix documentation and examples

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20170328102656.html

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/heatmaps_buildreport_20170330073624.html