annotatr.data - Githubissues

rcavalcante commented 8 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/rcavalcante/annotatr

REPOSITORY UPDATED to reflect software-only contribution.

Confirm the following by editing each check box to '[x]'

[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

I am familiar with the essential aspects of Bioconductor software management, including:

[x] The 'devel' branch for new packages and features.
[x] The stable 'release' branch, made available every six months, for bug fixes.
[x] Bioconductor version control using Subversion (optionally via GitHub).

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 8 years ago

Hi @rcavalcante

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: annotatr.data
Title: Annotation data for use with the annotatr package
Version: 0.99.0
Date: 2016-08-17
Authors@R: c(
    person("Raymond G.", "Cavalcante", email = "rcavalca@umich.edu", role = c("aut", "cre")),
    person(c("Maureen A."), "Sartor", email = "sartorma@med.umich.edu", role = c("ths")))
Description: Given a set of genomic sites/regions (e.g. ChIP-seq peaks, CpGs, differentially methylated CpGs or regions, SNPs, etc.) it is often of interest to investigate the intersecting functional annotations. Such annotations include those relating to gene models (promoters, 5'UTRs, exons, introns, and 3'UTRs), CpGs (CpG islands, CpG shores, CpG shelves), the non-coding genome, and enhancers. The annotatr.data package provides the annotation data used in the annotatr package.
Depends:
    R (>= 3.3.0),
    GenomicRanges
Imports:
    GenomeInfoDb,
    IRanges,
    methods,
    optparse,
    S4Vectors
Suggests:
    BiocStyle,
    devtools,
    knitr,
    org.Hs.eg.db,
    org.Mm.eg.db,
    rmarkdown,
    roxygen2,
    testthat
VignetteBuilder: knitr
BugReports: https://www.github.com/rcavalcante/annotatr.data/issues
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 5.0.1
biocViews: Software, Annotation, GenomeAnnotation, FunctionalGenomics, Visualization

bioc-issue-bot commented 8 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160818105333.html

rcavalcante commented 8 years ago

Hello,

It looks like the ERROR on zin1 is the result of an error when checking for my membership on the bioc-devel mailing list. The error didn't occur on the other build servers.

...
* Checking for bioc-devel mailing list subscription...
Error in curl::curl_fetch_memory(url, handle = handle) : 
  SSL connect error
Calls: <Anonymous> ... request_fetch -> request_fetch.write_memory -> <Anonymous> -> .Call
Execution halted

Should I simply push a trivial change to see if the SSL connect error goes away?

Thanks.

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

bbb9aa5 Add supported_genomes() and supported_annotations(...

mtmorgan commented 8 years ago

On 08/18/2016 10:57 AM, Raymond Cavalcante wrote:

Hello,

It looks like the ERROR on zin1 is the result of an error when checking for my membership on the bioc-devel mailing list. The error didn't occur on the other build servers.

|... * Checking for bioc-devel mailing list subscription... Error in curl::curl_fetch_memory(url, handle = handle) : SSL connect error Calls:
... request_fetch -> request_fetch.write_memory -> -> .Call Execution halted | Should I simply push a trivial change to see if the SSL connect error goes away?

yes, I agree with the diagnosis. No need to re-push. Martin

Thanks.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/Contributions/issues/87#issuecomment-240749872, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPGHyUKYRiz4RmVJMFewnHx3zT8UR4ks5qhHLygaJpZM4Jnhh3.

This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160818130234.html

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

4d1ab29 Remove Encoding attribute in DESCRIPTION * Causin...

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160818134857.html

bioc-issue-bot commented 8 years ago

Hi @rcavalcante,

Starting build on additional package https://github.com/rcavalcante/annotatr.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your additional package repository will NOT trigger a new build.

The DESCRIPTION file of this additional package is:

Package: annotatr
Title: Annotation of Genomic Regions to Genomic Annotations
Version: 0.8.0
Date: 2016-06-28
Authors@R: c(
    person("Raymond G.", "Cavalcante", email = "rcavalca@umich.edu", role = c("aut", "cre")),
    person(c("Maureen A."), "Sartor", email = "sartorma@med.umich.edu", role = c("ths")))
Description: Given a set of genomic sites/regions (e.g. ChIP-seq peaks, CpGs, differentially methylated CpGs or regions, SNPs, etc.) it is often of interest to investigate the intersecting genomic annotations. Such annotations include those relating to gene models (promoters, 5'UTRs, exons, introns, and 3'UTRs), CpGs (CpG islands, CpG shores, CpG shelves), or regulatory sequences such as enhancers. The annotatr package provides an easy way to summarize and visualize the intersection of genomic sites/regions with genomic annotations.
Depends:
R (>= 3.2.0),
    GenomicRanges
Imports:
dplyr,
ggplot2,
    GenomeInfoDb,
    IRanges,
methods,
readr,
    regioneR,
reshape2,
    S4Vectors
Suggests:
BiocStyle,
devtools,
knitr,
rmarkdown,
roxygen2,
testthat
VignetteBuilder: knitr
BugReports: https://www.github.com/rcavalcante/annotatr/issues
License: GPL-3
LazyData: true
RoxygenNote: 5.0.1
biocViews: Software, Annotation, GenomeAnnotation, FunctionalGenomics, Visualization

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr_buildreport_20160818142410.html

bioc-issue-bot commented 8 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

5dfcad9 Increment version to trigger Bioc build * Despite...

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr_buildreport_20160818150457.html

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

9db543d Remove data/ and data-raw/ * The contents were mo...

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr_buildreport_20160818153852.html

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

8292e21 NAMESPACE alterations * Finally settled on which ...

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr_buildreport_20160818170048.html

mtmorgan commented 8 years ago

Thanks for your package and timely work on address the automatically generated issues. I want to ask for some clarification / make some preliminary recommendations before embarking on the formal technical review.

I'm glad that you appreciate the value of GenomicRanges for structuring data and allowing interoperability between packages. I also appreciate from your naming convention and return types of data objects that you're a fan of the tibble. These approaches aren't incompatible, and I'd like for you to provide a convenient way for your users to incorporate your output into their own Bioconductor work flows. I'm thinking either returning a GenomicRanges object (note that the mcols() can contain a column which is itself a GRanges) or that can be easily coerced (via as(x, "GRanges'), which indirectly invokes makeGRangesFromDataFrame(), or via an appropriately named and prominently available bespoke function in your package) to the GenomicRanges infrastructure.

I took a brief look at some of your code and was disappointed to see another implementation of a bed file parser. It would be much better to re-used rtracklayer::import.bed(), and to address any limitations in that package, rather than creating yet another parser with it's own set of idiosyncratic strengths / limitations.

Finally, I didn't really understand what the annotatr.data package provides, or rather how the data overlaps with / is different from the data available in the TxDb.* or ensembldb.* packages or AnnotationHub ensembl resources. Again, it would be great to consolidate these data resources as much as possible, to avoid duplication, offer users access to as rich a collection of resources as possible, and minimize the collective maintenance burden.

rcavalcante commented 8 years ago

Hello,

Thanks for helping me move this process along with your initial comments. I will answer inline:

I'm glad that you appreciate the value of GenomicRanges for structuring data and allowing interoperability between packages. I also appreciate from your naming convention and return types of data objects that you're a fan of the tibble. These approaches aren't incompatible, and I'd like for you to provide a convenient way for your users to incorporate your output into their own Bioconductor work flows. I'm thinking either returning a GenomicRanges object (note that the mcols() can contain a column which is itself a GRanges) or that can be easily coerced (via as(x, "GRanges'), which indirectly invokes makeGRangesFromDataFrame(), or via an appropriately named and prominently available bespoke function in your package) to the GenomicRanges infrastructure.

I prioritized tibble output for use with the plot_*() family of functions that call ggplot2 functions. Part of the way we're marketing the package is the summarization plots that are relatively easy to get.

As I understand your comment, in addition to the wide tibble output from annotate_regions() that includes annotation information/locations, data locations, and gene information, you're suggesting a GenomicRanges object with the same information be output? Perhaps the user would decide when running annotate_regions() which format to output so as not to consume unneeded memory?

I took a brief look at some of your code and was disappointed to see another implementation of a bed file parser. It would be much better to re-used rtracklayer::import.bed(), and to address any limitations in that package, rather than creating yet another parser with it's own set of idiosyncratic strengths / limitations.

I am essentially reusing the readr::read_delim() function (with parameters in place) with an added layer of creating the GenomicRanges object. I looked into rtracklayer::import.bed() when I noticed it mentioned in another package submission thread. I decided not to pursue using it for the following reasons:

The user's data may not be in BED format (or its variants), depending on what previous programs are used to generate the genomic regions. For example, some DNA methylation analysis programs output files with only one location column because they are CpGs. It is not clear from the rtracklayer documentation how I could handle that case.
The data might be very large (25 million CpGs), and read.table is quite slow in such cases. In developing the package I settled on readr::read_delim() as a basis for the read function because it benchmarked to be much faster.
Related to the data being very large, I wanted users to be able to filter out unnecessary columns at read in using the col_types parameter. While rtracklayer::import.bed() has an extraCols parameter, the documentation implies that it is only to say what extra columns are, rather than avoid reading in particular columns.

Finally, I didn't really understand what the annotatr.data package provides, or rather how the data overlaps with / is different from the data available in the TxDb.* or ensembldb.* packages or AnnotationHub ensembl resources. Again, it would be great to consolidate these data resources as much as possible, to avoid duplication, offer users access to as rich a collection of resources as possible, and minimize the collective maintenance burden.

In brief, the basic annotations are the same, but annotatr.data provides additional annotations not in TxDb.* and more detailed annotations that are pre-computed to save runtime.

I had a look at the TxDb.* packages and noticed that they consist of an SQLite database and functions that I assume do some queries to give the results. Some of these extractions require computation for things such as UTRs. Is that correct?

The basic genic annotations (promoters, introns, exons, UTRs) are from the knownGene table from UCSC, so they are the same as TxDb.*. However, we pre-compute the following annotations that are not immediately in TxDb.* to save the user runtime: first introns, first exons, intron/exon boundaries, exons and introns occurring only in CDSs, and exons and introns occurring only in 5'UTRs and 3'UTRs.

In addition to genic annotations, annotatr.data includes CpG related annotations (island, shore, shelf, and interCGI), enhancer annotations, and intron/exon boundary annotations, which are not found in the TxDb.* family of packages.

mtmorgan commented 8 years ago

On 08/23/2016 09:46 PM, Raymond Cavalcante wrote:

Hello,

Thanks for helping me move this process along with your initial comments. I will answer inline:
I'm glad that you appreciate the value of GenomicRanges for
structuring data and allowing interoperability between packages. I
also appreciate from your naming convention and return types of data
objects that you're a fan of the tibble. These approaches aren't
incompatible, and I'd like for you to provide a convenient way for
your users to incorporate your output into their own Bioconductor
work flows. I'm thinking either returning a GenomicRanges object
(note that the mcols() can contain a column which is itself a
GRanges) or that can be easily coerced (via as(x, "GRanges'), which
indirectly invokes makeGRangesFromDataFrame(), or via an
appropriately named and prominently available bespoke function in
your package) to the GenomicRanges infrastructure.
I prioritized tibble output for use with the |plot_*()| family of functions that call |ggplot2| functions. Part of the way we're marketing the package is the summarization plots that are relatively easy to get.

As I understand your comment, in addition to the wide tibble output from |annotate_regions()| that includes annotation information/locations, data locations, and gene information, you're suggesting a GenomicRanges object with the same information be output? Perhaps the user would decide when running |annotate_regions()| which format to output so as not to consume unneeded memory?

yes, and I guess following the 'endomorphism' lesson of dplyr (return the same data structure as went in) the idea would be to return the original GRanges with additional mcols(), with 1:many or 1:0 mappings leading to *List columns, e.g., GRangesList().

I took a brief look at some of your code and was disappointed to see
another implementation of a bed file parser. It would be much better
to re-used rtracklayer::import.bed(), and to address any limitations
in that package, rather than creating yet another parser with it's
own set of idiosyncratic strengths / limitations.
I am essentially reusing the |readr::read_delim()| function (with parameters in place) with an added layer of creating the |GenomicRanges| object. I looked into |rtracklayer::import.bed()| when I noticed it mentioned in another package submission thread. I decided not to pursue using it for the following reasons:

The user's data may not be in BED format (or its variants), depending on what previous programs are used to generate the genomic regions. For example, some DNA methylation analysis programs output files with only one location column because they are CpGs. It is not clear from the |rtracklayer| documentation how I could handle that case.

I was thinking primarily of BED input

The data might be very large (25 million CpGs), and |read.table| is quite slow in such cases. In developing the package I settled on |readr::read_delim()| as a basis for the read function because it benchmarked to be much faster.

modern versions of rtracklayer::import.bed are implemented in C; I haven't done a benchmark with readr, but I would be surprised if it wasn't competitive. In some ways this is an example of the benefit of a single import function -- users complain about something (speed in this case) and it gets fixed for everyone's benefit and not just for those using package X.

Related to the data being very large, I wanted users to be able to filter out unnecessary columns at read in using the |col_types| parameter. While |rtracklayer::import.bed()| has an |extraCols| parameter, the documentation implies that it is only to say what extra columns are, rather than avoid reading in particular columns.

with readr I guess you're reading the data then making a copy to GRanges, so I'm not sure whether this is really an issue?

Finally, I didn't really understand what the annotatr.data package
provides, or rather how the data overlaps with / is different from
the data available in the TxDb.* or ensembldb.* packages or
AnnotationHub ensembl resources. Again, it would be great to
consolidate these data resources as much as possible, to avoid
duplication, offer users access to as rich a collection of resources
as possible, and minimize the collective maintenance burden.
In brief, the basic annotations are the same, but |annotatr.data| provides additional annotations not in |TxDb.*| and more detailed annotations that are pre-computed to save runtime.

I had a look at the |TxDb.*| packages and noticed that they consist of an |SQLite| database and functions that I assume do some queries to give the results. Some of these extractions require computation for things such as UTRs. Is that correct?

The basic genic annotations (promoters, introns, exons, UTRs) are from the |knownGene| table from UCSC, so they are the same as |TxDb.|. However, we pre-compute the following annotations that are not immediately in |TxDb.| to save the user runtime: first introns, first exons, intron/exon boundaries, exons and introns occurring only in CDSs, and exons and introns occurring only in 5'UTRs and 3'UTRs.

I'm really concerned that these speed optimizations come at a big price in terms of flexibility and duplication of effort. Here are the first exons

first_by_tx <- function(txdb) { exbytx = exonsBy(txdb, "tx") unlist(exbytx)[unlist(exbytx)$exon_rank == 1] }

system.time(first_by_tx(TxDb.Hsapiens.UCSC.hg19.knownGene)) user system elapsed 1.464 0.008 1.469

much of the time is in the data base query

system.time(exbytx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")) user system elapsed 1.324 0.004 1.329 system.time(ans <- unlist(exbytx)[unlist(exbytx)$exon_rank == 1]) user system elapsed 0.072 0.012 0.083

which can be shared across calculations, e.g., first and last.

In addition to genic annotations, |annotatr.data| includes CpG related annotations (island, shore, shelf, and interCGI), enhancer annotations, and intron/exon boundary annotations, which are not found in the |TxDb.*| family of packages.

yep, the new annotations are potentially interesting (are intron / exon boundaries easily derived from the exon coordinates in TxDb), and I'm all for making these available.

My preference, actually, would be to add these as resources in AnnotationHub, and actually

library(AnnotationHub) hub = AnnotationHub() query(hub, c("CpG Island", "hg19")) AnnotationHub with 1 record

snapshotDate(): 2016-08-15

names(): AH5086

$dataprovider: UCSC

$species: Homo sapiens

$rdataclass: GRanges

$title: CpG Islands

$description: GRanges object from UCSC track 'CpG Islands'

$taxonomyid: 9606

$genome: hg19

$sourcetype: UCSC track

$sourceurl:

rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database...

$sourcelastmodifieddate: NA

$sourcesize: NA

$tags: c("cpgIslandExt", "UCSC", "track", "Gene", "Transcript",

"Annotation")

retrieve record with 'object[["AH5086"]]'

hub[["AH5086"]] # first time downloads to local, persistent, cache

system.time(cpg <- hub[["AH5086"]]) loading from cache '/home/mtmorgan/.AnnotationHub/5086' user system elapsed 0.044 0.000 0.262 cpg GRanges object with 28691 ranges and 1 metadata column: seqnames ranges strand | name
| [1] chr1 [ 28736, 29810] \* | CpG:_116 [2] chr1 [135125, 135563] \* | CpG:_30 [3] chr1 [327791, 328229] \* | CpG:_29 [4] chr1 [437152, 438164] \* | CpG:_84 [5] chr1 [449274, 450544] \* | CpG:_99 ... ... ... ... . ... [28687] chr9_gl000201_random [ 15651, 15909] \* | CpG:_30 [28688] chr9_gl000201_random [ 26397, 26873] \* | CpG:_43 [28689] chr11_gl000202_random [ 16284, 16540] \* | CpG:_23 [28690] chr17_gl000204_random [ 54686, 57368] \* | CpG:_228 [28691] chr17_gl000205_random [117501, 117801] \* | CpG:_23 --- seqinfo: 93 sequences (1 circular) from hg19 genome

but in full disclosure

query(hub, c("CpG Island", "hg38")) AnnotationHub with 0 records

snapshotDate(): 2016-08-15

which just means that the hub should be updated, so all can benefit. The process for adding new resources to AnnotationHub is at http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHubData/inst/doc/IntroductionToAnnotationHubData.html though in general we are more than happy to help this along.

Martin

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/Contributions/issues/87#issuecomment-241934869, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPGHvaeO2uTiIwh7x1RrnCNCGI8KIPks5qi6J7gaJpZM4Jnhh3.

This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.

mtmorgan commented 8 years ago

I guess I was mis-remembering import.bed and implementation in C, sorry about that.

rcavalcante commented 8 years ago

Hi Martin,

I wanted to clarify our thread into an actionable list. Let me know what you think of the following:

Since rtracklayer::import.bed() doesn't cover all the file formats annotatr needs to be able to handle (this already occurs for us in a pipeline we're developing), is it acceptable to continue using annotatr::read_regions() and not use rtracklayer::import.bed()?
With respect to annotatr::annotate_regions(), I can play around with implementing GenomicRanges output for that function. However, tibble output is still needed for the plot_*() and summarize_*() families of functions. It is a bit awkward to have the user apply a function to cast the GenomicRanges object to tbl_df in order to use the plot_*() and summarize_*(), but that is one solution. Is this an acceptable compromise?
With respect to TxDb.*, something I didn't mention before is I collapsed the annotations by genomic location and type, so that, for example, an exon that belongs to 20 isoforms (with identical start and end) appears only once in the exon annotation, but has 20 comma-delimited tx_ids in the correct field. Representing the annotations in this way saves time at the intersection step by reducing the search space, and reduces the size of the output reporting the annotations. Perhaps that ends up being negligible in reality. I'd have to do some tests to see how that worked out.
Also with respect to TxDb.*, what I'm hearing is that I should calculate the annotations that we want on the fly, with TxDb.*, when a user wants to do annotation. Do I understand correctly? As far as new annotations like intron/exon boundaries, are you thinking I would write a function to add to GenomicFeatures similar to fiveUTRsByTranscript() that queried the underlying SQLite database of TxDb.* and constructed the appropriate annotation? We had wanted to differentiate between intron-into-exon and exon-into-intron, just as an aside. Would doing something similar work for the CpG shores, shelves, and interCGI space? And where would I put other annotations that perhaps don't work in TxDb.*, like the enhancers? I'm concerned that without an annotatr.data package, we would go over the Bioconductor package size limit.

Again, thanks for your thoughts, and helping me through my first Bioconductor submission.

-Raymond

mtmorgan commented 8 years ago

yep, ok
I'd add a argument as=c("tbl", "GRanges") and in the body of the code use match.arg(); it'll default to tbl. The final line of annotate_region would coerce to GRanges (via a helper function of some kind, I guess) if appropriate.

Something like?

gr = GRanges(rep(c("chr1", "chr2"), 5), IRanges(1, 5), id=c(1:5, 11:15))
ugr = unique(granges(gr))
mcols(ugr)$id = splitAsList(gr$id, match(gr, ugr))

Yes, calculate on the fly. Write the functions in your pkg, and but if generally useful then yes, they could be ported (and welcomed) to GenomicFeatures. The non-TxDb annotations could stay in annotatr.data, or made available via AnnotationHub.

bioc-issue-bot commented 8 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

bioc-issue-bot commented 8 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

mtmorgan commented 8 years ago

I was waiting for a version bump and successful build of your package before reviewing...

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

82f6f41 Version bump + reduce R dependency

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

bd0feee Version bump + reduce R dependency

rcavalcante commented 8 years ago

Hi Martin,

I bumped the version numbers and changed the R dependency to at least 3.2.0. Since the start of semester I haven't had the time to complete the changes you've suggested. I've especially been having difficulty with the TxDb.* databases and understanding how those changes will work.

As for adding hg38 CpG islands and gaps (and the same for mm10), which are going to be required for us, it's not clear to me from the documentation whether these objects need a recipe? Or if I need to create a database object for them?

Thanks, Raymond

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr_buildreport_20160916095503.html

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160916095745.html

rcavalcante commented 8 years ago

It appears that these warnings relate to my allowing R 3.2.0. Is there a particular reason for this, or is that a general recommendation?

mtmorgan commented 8 years ago

From the Bioconductor perspective, your package will be introduced and tested in the 'devel' branch, and then migrated to the release branch in October. The current devel / release branches are tested against R-3.3.1, so it is 'misleading advertising' and more-or-less irrelevant to say that your package works with R-3.2.0.

For the annotation resources, I think that you should find the corresponding hg19 resources in AnnotationHub, and verify that your code would work with comparable hg38 data. Then let us know precisely what the hg19 AnnotationHub resources are, so that we (bioconductor personell) can add the hg38 analogs. I think that if you specify this, we can review your package and work to add the appropriate annotation resources in parallel.

bioc-issue-bot commented 8 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

bioc-issue-bot commented 8 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

8bad13e Forgot to version bump

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

ae03d6c Forgot to version bump

rcavalcante commented 8 years ago

I will go through the hg19 resources and indicate their analogues for hg38, thank you for helping me with that in parallel.

I have changed the dependency back to 3.3.0. I understand that from Bioconductor's perspective the build system is for the current devel version, and so they cannot verify backwards compatibility.

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr_buildreport_20160916114912.html

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160916115129.html

rcavalcante commented 8 years ago

It looks like annotatr.data is fine, but for annotatr, it looks like the ERROR on zin1 is the mailing list again:

* Checking for bioc-devel mailing list subscription...
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Server returned nothing (no headers, no data)
Calls: <Anonymous> ... request_fetch -> request_fetch.write_memory -> <Anonymous> -> .Call

As for the moscato WARNING, there was this problem, which never occurred before, and I've only changed dependency versions, which makes it a rather odd WARNING.

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  there is no package called 'optparse'
Error: package 'annotatr.data' could not be loaded
Execution halted

lawremi commented 8 years ago

A little late to the party, but referring to the discussion two weeks ago,

On the topic of BED parsing. I'm concerned that you are supporting arbitrary tabular data. The lack of a formal definition leads to problems like not knowing whether the ranges are 0 or 1-based. Are we talking about the BEDX+Y formats, or something more arbitrary? If scalability is a concern, we can address that, but I'd like to see some profiling first. In the long run, it might pay off to stick with the standard Bioconductor BED parser, because you never know when someone will decide to rewrite it in C.
I'm not sure why you would require a tbl_df for passing to ggplot2 functions. All you need is a fortify() method to support more complex data structures with ggplot2.
Calling exons() on TxDb already yields the output you describe, i.e., unique regions that have a collapsed tx_id column. I'm concerned that you mention CSV storage of the transcript IDs. Hopefully that only pertains to the on-disk representation.

rcavalcante commented 8 years ago

Hello,

With respect to BED parsing, as I've noted above, and Martin seemed to acquiesce, not all files giving genomic regions are BED or BED-like. In particular, bismark outputs non-BED files. This package was partly developed to annotate that data without requiring a user to reformat output. As for 0- or 1-based regions, the annotations are all 0-based as they come from the UCSC genome browser, and I consider it the user's responsibility to know what they were doing as far as that goes.

With respect to your second item, I guess I thought of it as a way to check that the user comes to the plot_*() functions with the result of annotate_regions(). I don't doubt that perhaps that's not the best thing to do. Since I used dplyr::bind_rows() instead of base::Reduce(rbind, ...) (which appeared to be much faster in the tests I did), the tbl_df was sort of coerced. I'm not familiar with the concept of fortify() as it relates to ggplot2, so that would explain why I didn't go that route.

The annotations are stored in data/ and are compressed RData files. I have raw data in data-raw/ because I was under the impression that including it was good practice... Also, data-raw/ is in .Rbuildignore, so it isn't included in the prepared package. As for exons(), Martin illustrated what you described above.

I welcome your suggestions and comments, but I do ask that you consider the fact that our backgrounds are likely different, and that I will not always be aware of practices that you consider to be common knowledge.

Thanks.

lawremi commented 8 years ago

Ok, I took the time to look at the code, and it seems that it would be inconvenient for the user to parse a standard BED file, because the function requires the format to be specified via col_names. Maybe I missed something, but is there a convenience function for a standard format like BED or GFF? If not, you could add one, and maybe it could call rtracklayer for the standard formats, at least eventually.
It's really important to use the standard Bioconductor data structures for the sake of interoperability. That means not just returning data structures like GRanges but also accepting them as input. Btw, do you mean do.call(rbind, list) instead of Reduce? Using Reduce would be ridiculously slow for long inputs. I do eventually want to fix the slowness of R connections. I encourage you to look into fortify() as that's critical for interoperating between formal data structures and ggplot2.
What I meant was that exons() already does what you want, without using the code provided by Martin.

rcavalcante commented 8 years ago

[1]. Since the function in rtracklayer doesn't support the ENCODE flavor of BED files (narrowPeak, broadPeak, etc), it seems a bit cumbersome to read those files with rtracklayer (colClasses needs to be employed, similar to what I did with read_regions()). Is there any plan to support the ENCODE flavor of BEDs? This would be really convenient and I bet a lot of users would be happy about it. When you say convenience function for BED, GFF, etc. do you mean revert to the rtracklayer function, or do something like extension detection and then autofill the col_names and col_types in read_regions()? I suppose either is fine for me, I just want to clarify that point.

[2]. I didn't realize Reduce was always slow for long inputs, so my mistake. From your comment, it seems that you would prefer GRanges be accepted as input to the plot_*() functions? In essence, the fortify() method would need to flatten out the GRanges for use in ggplot2 functions that are in the plot_*() functions? That would seem to imply that annotate_regions() would by default output a GRanges object, and if a user wanted to write it, I'd need to spend the time flattening it into a table for them there too. As Martin and I discussed above, it would seem that giving the user the option of GRanges or tbl_df as outputs for annotate_regions() would be the easiest route. As a side note, when looking into ?fortify I get the following, which makes me wonder about the correct way forward:

Rather than using this function, I now recomend using the ‘broom’
     package, which implements a much wider range of methods. ‘fortify’
     may be deprecated in the future.

[3]. I'm working on the TxDb.* transition, and may ask questions about things I don't understand about GRangesLists later.

lawremi commented 8 years ago

rtracklayer does support all of those ENCODE formats (what I call BEDX+Y) via the extraCols argument. I don't think it supports specifying the data type, so there is some inefficiency in the type inference, but that might not matter.
Nice to see that there is some convergence on the broom framework. There is already a biobroom package that might do what you want already. Also, you should consider supporting alternative workflows, where the GRanges or whatever is created some way other than annotate_regions().

rcavalcante commented 8 years ago

Could you give an example of reading a narrowPeak file? I just downloaded a random narrowPeak from ENCODE and got:

# head ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak
chr19   49091824    49092345    .   0   .   889.660167307221    -1  3.78440330172555    289
chr19   3184015 3184611 .   0   .   776.503399491842    -1  3.78440330172555    258
chr22   19710615    19711165    .   0   .   747.097041388992    -1  3.78440330172555    331
chr19   54605508    54606001    .   0   .   719.503478674196    -1  3.78440330172555    288

class_names = c('name','score','strand','signalValue','pValue','qValue','peak')
class_type = c('character','integer','character','numerical','numerical','numerical','integer')
names(class_type) = class_names
#> class_type
#      name       score      strand signalValue      pValue      qValue 
#"character"   "integer" "character" "numerical" "numerical" "numerical" 
#       peak 
#  "integer" 
test = import.bed('ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak', extraCols = class_type)
#Error in methods::as(data[[i]], colClasses[i]) : 
#  no method or default for coercing “character” to “numerical”

Did I construct the extraCols parameter correctly? I assumed extra refers to after column 3. Perhaps in a future release an example could be given in the documentation for the use of extraCols.

rcavalcante commented 8 years ago

I thought perhaps extra refers to after column 6, but I got the same error.

rcavalcante commented 8 years ago

Continuing with this point, I wanted to put the two pieces of code side-by-side for comparison.

Here is rtracklayer:

class_type = c('numerical','numerical','numerical','numerical')
class_names = c('signalValue','pValue','qValue','peak')
names(class_type) = class_names
test = import.bed('ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak', 
    genome='hg19',
    extraCols = class_type)

And here is annotatr:

col_names = c('chr','start','end','name','score','strand','signalValue','pValue','qValue','peak')
col_types = 'ciicicdddi'
test2 = read_regions('ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak',
    genome = 'hg19',
    col_names = col_names,
    col_types = col_types)

To my eye, this is essentially the same approach, but there are two points that I would like to make.

I get the aforementioned error in import.bed().
For a regular BED3 file, import.bed() with genome specified is around 10x slower than read_regions(). When I do not include genome in import.bed(), it is about the same and slightly faster than read_regions().

library(microbenchmark)
library(rtracklayer)
library(annotatr)

read_rtrack_genome = function(){
    test = import.bed('test.bed', genome='hg19')
}

read_rtrack_nogenome = function(){
    test = import.bed('test.bed')
}

read_annotatr = function(){
    test = read_regions(file = 'test.bed', genome='hg19', stranded = FALSE, col_names = c('chr','start','end'))
}

test_read = microbenchmark(read_rtrack_genome(), read_rtrack_nogenome(), read_annotatr(), times = 20)

# Unit: milliseconds
#                    expr        min         lq      mean    median        uq
#    read_rtrack_genome() 11162.2130 11957.1516 12209.441 12067.947 12372.063
#  read_rtrack_nogenome()   870.9019   959.1449  1078.028  1124.781  1194.777
#         read_annotatr()  1107.8230  1151.1748  1354.751  1206.989  1467.281
#        max neval cld
#  14257.982    20   b
#   1233.394    20  a 
#   2259.304    20  a

While I appreciate the notion of reusing code, in this particular case, I do not see the advantage of using rtracklayer::import.bed() over what I have implemented based on readr.

In the thread above, @mtmorgan had agreed that this part of the package would not need to be changed, so I ask for your agreement in order to move the submission forward.

I am actively working on changing the basis for annotations to the TxDb.* group of packages, and expect to push those changes within the next couple days.

Bioconductor / Contributions

annotatr.data #87

snapshotDate(): 2016-08-15

names(): AH5086

$dataprovider: UCSC

$species: Homo sapiens

$rdataclass: GRanges

$title: CpG Islands

$description: GRanges object from UCSC track 'CpG Islands'

$taxonomyid: 9606

$genome: hg19

$sourcetype: UCSC track

$sourceurl:

$sourcelastmodifieddate: NA

$sourcesize: NA

$tags: c("cpgIslandExt", "UCSC", "track", "Gene", "Transcript",

"Annotation")

retrieve record with 'object[["AH5086"]]'

hub[["AH5086"]] # first time downloads to local, persistent, cache

snapshotDate(): 2016-08-15