Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

GenomicScores #324

Closed rcastelo closed 7 years ago

rcastelo commented 7 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 7 years ago

Hi @rcastelo

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: GenomicScores
Type: Package
Title: Infrastructure to work with genomewide position-specific scores
Description: Provide infrastructure to store and access genomewide position-speicfic scores within R and Bioconductor.
Version: 0.99.0
Authors@R: c(person("Robert", "Castelo", role=c("aut", "cre"),
    email="robert.castelo@upf.edu"), person("Pau", "Puigdevall",
    role="ctb", email="pau.puigdevall@upf.edu"))
License: Artistic-2.0
Depends: R (>= 3.4), S4Vectors (>= 0.7.21), GenomicRanges, methods,
    BiocGenerics (>= 0.13.8)
Imports: utils, IRanges (>= 2.3.23), BSgenome, GenomeInfoDb,
    AnnotationHub, AnnotationHubData
Suggests: BiocStyle, knitr, rmarkdown, BSgenome.Hsapiens.UCSC.hg19,
    phastCons100way.UCSC.hg19
VignetteBuilder: knitr
URL: https://github.com/rcastelo/GenomicScores
BugReports: https://github.com/rcastelo/GenomicScores/issues
Encoding: UTF-8
LazyData: yes
biocViews: Genetics, Annotation, Sequencing, Coverage
bioc-issue-bot commented 7 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GenomicScores_buildreport_20170318042321.html

lawremi commented 7 years ago

Interesting idea. The vignette should demonstrate how these data might be used in a few typical cases.

I'm not sure the user really wants an opaque SNPlocs-like object here. The tabular GPos seems more intuitive and directly useful. A more efficient and useful backend would be BigWig, instead of per-sequence RDS files. BigWig gives you fast range-based queries, tile-based summaries, and zooming in visualizations. What about a lazy GPos(-like) object, backed by a BigWig? It would yield an in-memory GPos after the first restriction.

I guess the primary use case is a read-and-summarize over a set of ranges, i.e., the scores() method. GPos might need more summarizing capability, like RleViews. It would be nice to yield a GRanges when summarizing over ranges, instead of just a vector, as with scores(). Just some initial impressions.

rcastelo commented 7 years ago

@lawremi thanks for taking a look. In fact the idea to make this package started from a suggestion you made to me a couple of years ago to factorize out from VariantFiltering the code that provides support for phastCons annotation packages, and avoid loading unnecessary stuff when using phastCons packages. In the end, we have attempted to give a more general support (beyond phastCons scores), taking also the suggestion of Martin he made to me also time ago about putting genomic scores in AnnotationHub (Valerie and Martin have helped us with that). Regarding your suggestions, I tried in the past to use GPos objects but at least in my hands they required more disk space and gave poorer performance than Rle vectors. Regarding BigWig, sounds as a good alternative to load Rle objects but I'm not familiar with them beyond using rtrackalyer::import.bw(), is there an R i/f to write BigWig files?

Since in principle one is querying with a GRanges, I thought it is unnecessary to return the same object again, which may increase memory requirements, specially if querying tens of thousands of genomic positions, as in, e.g., exome analysis.

@hpages hi Hervé, if i understand the messages above correctly, you're handling this submission. the build is breaking at the vignette because it is expecting to use a newer version (3.5) of the annotation package phastCons100way.UCSC.hg19. since this is not an experiment data package i did not formally add it here but it can be downloaded from:

http://functionalgenomics.upf.edu/ongoing/phastCons100way.UCSC.hg19_3.5.0.tar.gz

this package depends on GenomicScores, so i guess we have some circularity here. in my computer, i built GenomicScores w/o vignettes, install it, then install phastCons100way.UCSC.hg19 v3.5 and then GenomicScores builds, checks and BiocChecks cleanly. how do you think we could deal with this here?

lawremi commented 7 years ago

Yes, rtracklayer writes BigWig files. I think it's a good idea to yield a GRanges from the summary calculation. Think of how aggregate() returns a data.frame that includes grouping factor(s). It's just one element per query range, right? So it's on the same order as the input.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

e1fc442 Fixed vignette and examples until the newer phastC...

rcastelo commented 7 years ago

@lawremi thanks for the clarification, it sounds good to me, i'll have to wait however to implement that change until the newer phast100way.UCSC.hg19 version 3.5 becomes available. By now i've just pushed a change to make the vignette work with the version 3.4 of this annotation package, so that the whole build, check and BiocCheck can go forward.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GenomicScores_buildreport_20170322073205.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

674a539 Fix to avoid the WARNING on using AnnotationForge:...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GenomicScores_buildreport_20170327113127.html

hpages commented 7 years ago

Hi Robert,

Thanks for your submission.

First a comment on the design of the GScores container. I guess it's ok and you can always change this at some point in the future but I wanted to point out that it would have been better to follow the design of the new SNPlocs container instead of the old one. The new one (introduced in BioC devel) uses an OnDiskLongTable with spatial index. It's a cleaner design and it's much more efficient than the old design, in particular for querying data by genomic ranges. As you know (because you have some responsibility for this recent move ;-) ) this new design is very recent. Also these things are BSgenome internal business so not documented so not necessarily easy to re-use in the context of GenomicScores at the moment. All this to say that I perfectly understand that you didn't rush into this.

Making OnDiskLongTable easier to re-use by other developpers is actually something I want to remedy in the near future by exposing and documenting it. My long-term goal with OnDiskLongTable, GPos, and long-Rle objects is to be able to represent nucleotide-level annotations along a full genome in a single object that is compact on disk and allows efficient querying. It would be similar to the kind of lazy GPos(-like) object that Michael mentioned but backed by an OnDiskLongTable object (where the columns can be anything) instead of a BigWig (where the data can only be numeric).

Otherwise, I found a few issues while playing around with the package. See below for the details.

Cheers, H.

  1. Typo in Description field: speicfic

  2. The package has no data/ folder so LazyData: yes has no effect. (Generally speaking the use of LazyData: yes is discouraged.)

  3. Are you sure you need to depend/import AnnotationHubData. You don't seem to use it anywhere in the package.

  4. With the latest GenomicScores (0.99.2) I get:

    > availableGScores()
    Error in h1$links() : attempt to apply non-function
  5. Now that you have your own .getSubDirs() function, it would make sense to use it in the inst/scripts/make-metadata_* scripts instead of using AnnotationForge:::.getSubDirs().

  6. The scores() generic defined in VariantFiltering masks the scores() generic defined in GenomicScores if for example I load annotation package phastCons100way.UCSC.hg19 (which depends on VariantFiltering) like in the man page for availableGSscores() where you have:

    if (require(phastCons100way.UCSC.hg19)) 

    and then, later on:

    gsco <- getGScores("phastCons100way.UCSC.hg19")
    scores(gsco, GRanges(seqnames="chr7", IRanges(start=117232380, width=5)))

    This last line of code doesn't work because the scores() generic defined in GenomicScores and its method for signature GScores,GRanges are now masked. But maybe this is just a temporary situation? Is the plan to have VariantFiltering depend on GenomicScores once the latter is added to svn and remove the scores() generic from the former?

  7. The GScores class has exactly the same slots as your PhastConsDb class defined in VariantFiltering. Is the former intended to replace the latter once GenomicScores is added to svn?

  8. The running examples are using a PhastConsDb object. There are also examples using a GScores object but all of them are in a \dontrun statement. Since GScores objects are at the center of the package, you would need to provide running examples using these objects. But again, maybe that's just a temporary situation?

  9. Please document and provide examples for the qfun() and dqfun() functions. I have no idea what they are

  10. There are 2 issues with the current gpos argument of scores():

    a) I would argue that it should accept any kind of GenomicRanges object i.e. not only GRanges, but also GPos, DelegatingGenomicRanges, GNCList, etc... and more generally any subclass of virtual class GenomicRanges. Concretely that means that the "scores" method with signature GScores,GRanges would just need to have its signature changed to GScores,GenomicRanges. That's all. The implementation of the method doesn't need to be modified because AFAICT the operations you do on gpos (e.g. length(), seqlevelsStyle(), seqnames(), etc) work on any GenomicRanges object.

    b) Also the name of the gpos argument is confusing because it suggests that it needs to be a GPos object which is not the case. A more conventional name for such argument is ranges (e.g. like all the *ByOverlaps() functions in GenomicFeatures and snpsByOverlaps() in BSgenome).

  11. According to its man page the "scores" method with signature GScores,GRanges accepts extra arguments summaryFunFunction and coercionFunFunction. However the real names seem to be summaryFun and coercionFun.

  12. Trying to pass a function via the summaryFun or coercionFun argument doesn't work:

    > scores(gsco, GRanges("chr7", IRanges(start=117232380, width=5)), summaryFun=max)
    Error in summaryFun != "mean" : 
      comparison (2) is possible only for atomic and list types

    It works only if the function name is passed as a string (e.g. "max"). However one should be able to specify a callback function by passing the function symbol (e.g. summaryFun=max) or an anonymous function (e.g. summaryFun=function(x) sum(is.na(x))). The standard mechanism to achieve this is simply to call match.fun() on the summaryFun or coercionFun argument internally e.g.

    summaryFun <- match.fun(summaryFun)

    See for example the 1st line in base::lapply's body.

  13. I'm not exactly sure why I get these results when summarizing with sum() or prod():

    > scores(gsco, GRanges("chr7", IRanges(start=117232380:117232384, width=1)))
    [1] 0.8 0.8 1.0 1.0 1.0
    > scores(gsco, GRanges("chr7", IRanges(start=117232380, width=5)), summaryFun="sum")
    [1] 5
    > scores(gsco, GRanges("chr7", IRanges(start=117232380, width=5)), summaryFun="prod")
    [1] 10781
  14. Also handling of NAs seems problematic:

    > scores(gsco, GRanges("chr1", IRanges(10915:10922, width=1)))
    [1]  NA  NA  NA 0.3 0.3 0.2 0.2 0.2
    > scores(gsco, GRanges("chr1:10915-10922"), summaryFun="mean")
    [1] 0.1125   # should be NA
    > scores(gsco, GRanges("chr1:10915-10922"), summaryFun="max")
    [1] 0.3      # should be NA
    > scores(gsco, GRanges("chr1:10915-10922"), summaryFun="min")
    [1] NA       # OK
  15. Section 6 of the vignette is about low-level technical details. Although potentially of interest to some developpers or tech-oriented users, this is not really useful to the average user. So I'm not sure the vignette is the right place for this kind of information.

  16. Overall, I feel that the vignette focuses too much on the technical aspects of the GScores container. I agree with Michael that it should instead demonstrate how these data might be used in a few typical cases.

hpages commented 7 years ago

Hi Robert, You might also want to add the Infrastructure term to the biocViews field of the package. Thanks, H.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

20d3aa4 Various changes and fixes to address the BioC revi...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GenomicScores_buildreport_20170330151731.html

rcastelo commented 7 years ago

Hi Hervé,

Thank you very much for your thorough review of the package and all the useful remarks. I have pushed changes to the repo that, in principle, address your comments and I'll talk about them below but I'd like to start clarifying that the windows build seems to fail because it does not find 'knitr' as vignette builder. I hope that despite this aparently windows-specific error we can move on discussing the changes I've made.

When I learned from you about the new SNPlocs container I also thought immediately that this could be an interesting option for GenomicScores, as long as it allows one to store data using not only double-precision values, but also integers and raw bytes. The latter is used by GenomicScores to attempt reducing the disk and memory requirements and improve access retrieval by quantizing real-valued scores into a smaller subset of values. I also thought that exploring how to use the new SNPlocs container will be something for the next devel cycle.

Going now through the issues you raised.

  1. Done.
  2. Done.
  3. Removed. This must have been a leftover of earlier attempts with the AnnotationHub.
  4. Fixed. This was an error on my side when adapting the .getSubDirs() function from AnnotationForge.
  5. Done.
  6. One of the reasons to create GenomicScores was to move outside VariantFiltering the code that supports annotation packages such as phastCons100way.UCSC.hg19. Once GenomicScores is accepted, I will remove all the related code including the generic of the scores() function and VariantFiltering will be using GenomicScores to access those annotations.
  7. Exactly, the GenomicScores class is a replacement for the PhastConsDb class, which will disappear once GenomicScores is accepted and updated versions of the corresponding annotation packages are in place.
  8. Indeed, this is a temporary situation, the example uses a PhastConsDb object because is loading a BioC 3.4 version of phastCons100way.UCSC.hg19, once the 3.5 version of this annotation package enters into the build, then there will not be PhastConsDb objects anymore. The 'dontrun' examples correspond to those accessing the AnnotationHub through the internet. My understanding is that examples and vignettes should build without accessing the internet, right?
  9. I have temporarily removed the qfun() and dqfun() accessors until the newer annotation packages become available. These newer annotation packages store metadata including the corresponding quantizer and dequantizer functions used to "code" and "decode" the stored values. The qfun() and dqfun() accessors will allow the user to see how values were quantized and are dequantized, so that the way in which information is stored is completely open to the interested user.
  10. Excellent idea! I've changed from GRanges to GenomicRanges and the argument name from gpos to ranges.
  11. Fixed. I have also removed the coercionFun argument since this is a left over from the older PhastConsDb class and not necessary anymore now that the data objects stored the dequantizer function in their own metadata.
  12. Done. The problem was that when the summary function is mean() then the Rle-Views and ViewMeans() API is used because it provides a enormous speedup (thanks to @lawremi for pointing this out to me time ago!), so the function tried to do something different depending on whether the summary function was the mean() or not, comparing it with the character string "mean". Now it uses identical(summaryfun, mean).
  13. Fixed, this was again a glitch originated within the move from PhastConsDb to GenomicScores. Now it works fine:
    scores(gsco, GRanges("chr7", IRanges(start=117232380:117232384, width=1)))
    [1] 0.8 0.8 1.0 1.0 1.0
    scores(gsco, GRanges("chr7", IRanges(start=117232380, width=5)), summaryFun=sum)
    [1] 4.6
    scores(gsco, GRanges("chr7", IRanges(start=117232380, width=5)), summaryFun=prod)
    [1] 0.64

    Let me add that this currently works only using the GScores object downloaded through the AnnotationHub with the function getGScores(), since the corresponding annotation packages, such as phastCons100way.UCSC.hg19 are not yet updated.

  14. Fixed.
  15. I want that the user is aware that the data is quantized and that this means that for certain questions, such as filtering genetic variants, this is fine but depending on what you want to answer with these data, the stored precision may not be sufficient. For this reason, I felt it was necessary to illustrate the differences between raw original scores and quantized ones at least for the use case of phastCons conservation scores, which I believe are widely used from this package.
  16. Right before the former Section 6, I've added a use case on annotating variants with phastCons conservation scores. I think it illustrates the advantage of having at your fingertips in R a set of data that originally hangs at UCSC as 4.5 Gb BigWig files and how easily the queried scores are added into the same container you used to search for them.

I've added the Infrastructure term to the biocViews field of the package, it's an excellent suggestion.

One question, @lawremi suggested above that the scores() method returns the scores as a metadata column within the input GRanges object. I initially thought that this would unnecessarily increase the memory requirements (this is relevant of course with long queries only, ~10s of thousands of positions), but I see that this adds coherence to the output. What do you think? Should this be a default behavior? Should there be an option to return only the numeric vector?

Thanks again for your help.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

692370d Bump version, maybe windows builds fine on Fridays...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GenomicScores_buildreport_20170331114136.html

hpages commented 7 years ago

Hi Robert, Thanks for all these changes. Yes returning the scores as a metadata column of the input GRanges object is a good idea. It makes the returned object self-descriptive. Most users would probably do this with the current scores() anyway i.e. they would do something like mcols(gr)$scores <- scores(gsco, gr). You could add an extra argument (e.g. naked.scores or scores.only) to let the user control this but I think it should be FALSE by default. Only the advanced users/developers who are concerned about memory usage would use it. Cheers, H.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

8ae4553 scores() return now a GenomicRanges object with th...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GenomicScores_buildreport_20170405103010.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

d160b54 Fixed the missing scores.only=FALSE argument in th...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GenomicScores_buildreport_20170405112305.html

rcastelo commented 7 years ago

Hi Hervé,

I have pushed an update to make the scores() method returning by default a GRanges object with a scores metadata column, and add an option scores.only, which is set to FALSE by default but when set to TRUE, then it returns a numeric vector with the scores. Because the newer annotation phastCons100way.UCSC.hg19. package is not yet available, you will not see this in the vignette, but you can try the following to see it working in a R session:

library(GenomicScores)
gsco <- getGScores("phastCons100way.UCSC.hg19")
scores(gsco, GRanges(seqnames="chr7", IRanges(start=117232380, width=5)))
GRanges object with 1 range and 1 metadata column:
      seqnames                 ranges strand |    scores
         <Rle>              <IRanges>  <Rle> | <numeric>
  [1]     chr7 [117232380, 117232384]      * |      0.92
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
scores(gsco, GRanges(seqnames="chr7", IRanges(start=117232380, width=5)), scores.only=TRUE)
[1] 0.92

cheers,

robert.

hpages commented 7 years ago

Excellent! Thanks. I'm marking the package as accepted. I realize there are still important things that need to happen (modify VariantFiltering, rebuild and propagate the phastCons annotations, etc...) and we are short in time before the release. My feeling is that these changes have the potential to disrupt the builds so please be extra careful when you work on them, especially since BioC 3.5 enters feature freeze next Friday (at least in theory). Thanks again for this new contribution! H.

rcastelo commented 7 years ago

hi, great! thanks again for your help. I've submitted a few changes in VariantFiltering and GenomicScores to ensure that this transition do not disrupt the builds. I'll try to be extra careful.