Closed rcavalcante closed 8 years ago
Hi @rcavalcante
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: annotatr.data
Title: Annotation data for use with the annotatr package
Version: 0.99.0
Date: 2016-08-17
Authors@R: c(
person("Raymond G.", "Cavalcante", email = "rcavalca@umich.edu", role = c("aut", "cre")),
person(c("Maureen A."), "Sartor", email = "sartorma@med.umich.edu", role = c("ths")))
Description: Given a set of genomic sites/regions (e.g. ChIP-seq peaks, CpGs, differentially methylated CpGs or regions, SNPs, etc.) it is often of interest to investigate the intersecting functional annotations. Such annotations include those relating to gene models (promoters, 5'UTRs, exons, introns, and 3'UTRs), CpGs (CpG islands, CpG shores, CpG shelves), the non-coding genome, and enhancers. The annotatr.data package provides the annotation data used in the annotatr package.
Depends:
R (>= 3.3.0),
GenomicRanges
Imports:
GenomeInfoDb,
IRanges,
methods,
optparse,
S4Vectors
Suggests:
BiocStyle,
devtools,
knitr,
org.Hs.eg.db,
org.Mm.eg.db,
rmarkdown,
roxygen2,
testthat
VignetteBuilder: knitr
BugReports: https://www.github.com/rcavalcante/annotatr.data/issues
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 5.0.1
biocViews: Software, Annotation, GenomeAnnotation, FunctionalGenomics, Visualization
Your package has been approved for building. Your package is now submitted to our queue.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160818105333.html
Hello,
It looks like the ERROR on zin1 is the result of an error when checking for my membership on the bioc-devel mailing list. The error didn't occur on the other build servers.
...
* Checking for bioc-devel mailing list subscription...
Error in curl::curl_fetch_memory(url, handle = handle) :
SSL connect error
Calls: <Anonymous> ... request_fetch -> request_fetch.write_memory -> <Anonymous> -> .Call
Execution halted
Should I simply push a trivial change to see if the SSL connect error goes away?
Thanks.
Received a valid push; starting a build. Commits are:
bbb9aa5 Add supported_genomes() and supported_annotations(...
On 08/18/2016 10:57 AM, Raymond Cavalcante wrote:
Hello,
It looks like the ERROR on zin1 is the result of an error when checking for my membership on the bioc-devel mailing list. The error didn't occur on the other build servers.
|... * Checking for bioc-devel mailing list subscription... Error in curl::curl_fetch_memory(url, handle = handle) : SSL connect error Calls:
... request_fetch -> request_fetch.write_memory -> -> .Call Execution halted | Should I simply push a trivial change to see if the SSL connect error goes away?
yes, I agree with the diagnosis. No need to re-push. Martin
Thanks.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/Contributions/issues/87#issuecomment-240749872, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPGHyUKYRiz4RmVJMFewnHx3zT8UR4ks5qhHLygaJpZM4Jnhh3.
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160818130234.html
Received a valid push; starting a build. Commits are:
4d1ab29 Remove Encoding attribute in DESCRIPTION * Causin...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160818134857.html
Hi @rcavalcante,
Starting build on additional package https://github.com/rcavalcante/annotatr.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your additional package repository will NOT trigger a new build.
The DESCRIPTION file of this additional package is:
Package: annotatr
Title: Annotation of Genomic Regions to Genomic Annotations
Version: 0.8.0
Date: 2016-06-28
Authors@R: c(
person("Raymond G.", "Cavalcante", email = "rcavalca@umich.edu", role = c("aut", "cre")),
person(c("Maureen A."), "Sartor", email = "sartorma@med.umich.edu", role = c("ths")))
Description: Given a set of genomic sites/regions (e.g. ChIP-seq peaks, CpGs, differentially methylated CpGs or regions, SNPs, etc.) it is often of interest to investigate the intersecting genomic annotations. Such annotations include those relating to gene models (promoters, 5'UTRs, exons, introns, and 3'UTRs), CpGs (CpG islands, CpG shores, CpG shelves), or regulatory sequences such as enhancers. The annotatr package provides an easy way to summarize and visualize the intersection of genomic sites/regions with genomic annotations.
Depends:
R (>= 3.2.0),
GenomicRanges
Imports:
dplyr,
ggplot2,
GenomeInfoDb,
IRanges,
methods,
readr,
regioneR,
reshape2,
S4Vectors
Suggests:
BiocStyle,
devtools,
knitr,
rmarkdown,
roxygen2,
testthat
VignetteBuilder: knitr
BugReports: https://www.github.com/rcavalcante/annotatr/issues
License: GPL-3
LazyData: true
RoxygenNote: 5.0.1
biocViews: Software, Annotation, GenomeAnnotation, FunctionalGenomics, Visualization
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr_buildreport_20160818142410.html
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
Received a valid push; starting a build. Commits are:
5dfcad9 Increment version to trigger Bioc build * Despite...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr_buildreport_20160818150457.html
Received a valid push; starting a build. Commits are:
9db543d Remove data/ and data-raw/ * The contents were mo...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr_buildreport_20160818153852.html
Received a valid push; starting a build. Commits are:
8292e21 NAMESPACE alterations * Finally settled on which ...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr_buildreport_20160818170048.html
Thanks for your package and timely work on address the automatically generated issues. I want to ask for some clarification / make some preliminary recommendations before embarking on the formal technical review.
I'm glad that you appreciate the value of GenomicRanges for structuring data and allowing interoperability between packages. I also appreciate from your naming convention and return types of data objects that you're a fan of the tibble. These approaches aren't incompatible, and I'd like for you to provide a convenient way for your users to incorporate your output into their own Bioconductor work flows. I'm thinking either returning a GenomicRanges object (note that the mcols()
can contain a column which is itself a GRanges) or that can be easily coerced (via as(x, "GRanges')
, which indirectly invokes makeGRangesFromDataFrame()
, or via an appropriately named and prominently available bespoke function in your package) to the GenomicRanges infrastructure.
I took a brief look at some of your code and was disappointed to see another implementation of a bed file parser. It would be much better to re-used rtracklayer::import.bed()
, and to address any limitations in that package, rather than creating yet another parser with it's own set of idiosyncratic strengths / limitations.
Finally, I didn't really understand what the annotatr.data package provides, or rather how the data overlaps with / is different from the data available in the TxDb.* or ensembldb.* packages or AnnotationHub ensembl resources. Again, it would be great to consolidate these data resources as much as possible, to avoid duplication, offer users access to as rich a collection of resources as possible, and minimize the collective maintenance burden.
Hello,
Thanks for helping me move this process along with your initial comments. I will answer inline:
I'm glad that you appreciate the value of GenomicRanges for structuring data and allowing interoperability between packages. I also appreciate from your naming convention and return types of data objects that you're a fan of the tibble. These approaches aren't incompatible, and I'd like for you to provide a convenient way for your users to incorporate your output into their own Bioconductor work flows. I'm thinking either returning a GenomicRanges object (note that the mcols() can contain a column which is itself a GRanges) or that can be easily coerced (via as(x, "GRanges'), which indirectly invokes makeGRangesFromDataFrame(), or via an appropriately named and prominently available bespoke function in your package) to the GenomicRanges infrastructure.
I prioritized tibble output for use with the plot_*()
family of functions that call ggplot2
functions. Part of the way we're marketing the package is the summarization plots that are relatively easy to get.
As I understand your comment, in addition to the wide tibble output from annotate_regions()
that includes annotation information/locations, data locations, and gene information, you're suggesting a GenomicRanges object with the same information be output? Perhaps the user would decide when running annotate_regions()
which format to output so as not to consume unneeded memory?
I took a brief look at some of your code and was disappointed to see another implementation of a bed file parser. It would be much better to re-used rtracklayer::import.bed(), and to address any limitations in that package, rather than creating yet another parser with it's own set of idiosyncratic strengths / limitations.
I am essentially reusing the readr::read_delim()
function (with parameters in place) with an added layer of creating the GenomicRanges
object. I looked into rtracklayer::import.bed()
when I noticed it mentioned in another package submission thread. I decided not to pursue using it for the following reasons:
rtracklayer
documentation how I could handle that case.read.table
is quite slow in such cases. In developing the package I settled on readr::read_delim()
as a basis for the read function because it benchmarked to be much faster.col_types
parameter. While rtracklayer::import.bed()
has an extraCols
parameter, the documentation implies that it is only to say what extra columns are, rather than avoid reading in particular columns.Finally, I didn't really understand what the annotatr.data package provides, or rather how the data overlaps with / is different from the data available in the TxDb.* or ensembldb.* packages or AnnotationHub ensembl resources. Again, it would be great to consolidate these data resources as much as possible, to avoid duplication, offer users access to as rich a collection of resources as possible, and minimize the collective maintenance burden.
In brief, the basic annotations are the same, but annotatr.data
provides additional annotations not in TxDb.*
and more detailed annotations that are pre-computed to save runtime.
I had a look at the TxDb.*
packages and noticed that they consist of an SQLite
database and functions that I assume do some queries to give the results. Some of these extractions require computation for things such as UTRs. Is that correct?
The basic genic annotations (promoters, introns, exons, UTRs) are from the knownGene
table from UCSC, so they are the same as TxDb.*
. However, we pre-compute the following annotations that are not immediately in TxDb.*
to save the user runtime: first introns, first exons, intron/exon boundaries, exons and introns occurring only in CDSs, and exons and introns occurring only in 5'UTRs and 3'UTRs.
In addition to genic annotations, annotatr.data
includes CpG related annotations (island, shore, shelf, and interCGI), enhancer annotations, and intron/exon boundary annotations, which are not found in the TxDb.*
family of packages.
On 08/23/2016 09:46 PM, Raymond Cavalcante wrote:
Hello,
Thanks for helping me move this process along with your initial comments. I will answer inline:
I'm glad that you appreciate the value of GenomicRanges for structuring data and allowing interoperability between packages. I also appreciate from your naming convention and return types of data objects that you're a fan of the tibble. These approaches aren't incompatible, and I'd like for you to provide a convenient way for your users to incorporate your output into their own Bioconductor work flows. I'm thinking either returning a GenomicRanges object (note that the mcols() can contain a column which is itself a GRanges) or that can be easily coerced (via as(x, "GRanges'), which indirectly invokes makeGRangesFromDataFrame(), or via an appropriately named and prominently available bespoke function in your package) to the GenomicRanges infrastructure.
I prioritized tibble output for use with the |plot_*()| family of functions that call |ggplot2| functions. Part of the way we're marketing the package is the summarization plots that are relatively easy to get.
As I understand your comment, in addition to the wide tibble output from |annotate_regions()| that includes annotation information/locations, data locations, and gene information, you're suggesting a GenomicRanges object with the same information be output? Perhaps the user would decide when running |annotate_regions()| which format to output so as not to consume unneeded memory?
yes, and I guess following the 'endomorphism' lesson of dplyr (return the same data structure as went in) the idea would be to return the original GRanges with additional mcols(), with 1:many or 1:0 mappings leading to *List columns, e.g., GRangesList().
I took a brief look at some of your code and was disappointed to see another implementation of a bed file parser. It would be much better to re-used rtracklayer::import.bed(), and to address any limitations in that package, rather than creating yet another parser with it's own set of idiosyncratic strengths / limitations.
I am essentially reusing the |readr::read_delim()| function (with parameters in place) with an added layer of creating the |GenomicRanges| object. I looked into |rtracklayer::import.bed()| when I noticed it mentioned in another package submission thread. I decided not to pursue using it for the following reasons:
- The user's data may not be in BED format (or its variants), depending on what previous programs are used to generate the genomic regions. For example, some DNA methylation analysis programs output files with only one location column because they are CpGs. It is not clear from the |rtracklayer| documentation how I could handle that case.
I was thinking primarily of BED input
- The data might be very large (25 million CpGs), and |read.table| is quite slow in such cases. In developing the package I settled on |readr::read_delim()| as a basis for the read function because it benchmarked to be much faster.
modern versions of rtracklayer::import.bed are implemented in C; I haven't done a benchmark with readr, but I would be surprised if it wasn't competitive. In some ways this is an example of the benefit of a single import function -- users complain about something (speed in this case) and it gets fixed for everyone's benefit and not just for those using package X.
- Related to the data being very large, I wanted users to be able to filter out unnecessary columns at read in using the |col_types| parameter. While |rtracklayer::import.bed()| has an |extraCols| parameter, the documentation implies that it is only to say what extra columns are, rather than avoid reading in particular columns.
with readr I guess you're reading the data then making a copy to GRanges, so I'm not sure whether this is really an issue?
Finally, I didn't really understand what the annotatr.data package provides, or rather how the data overlaps with / is different from the data available in the TxDb.* or ensembldb.* packages or AnnotationHub ensembl resources. Again, it would be great to consolidate these data resources as much as possible, to avoid duplication, offer users access to as rich a collection of resources as possible, and minimize the collective maintenance burden.
In brief, the basic annotations are the same, but |annotatr.data| provides additional annotations not in |TxDb.*| and more detailed annotations that are pre-computed to save runtime.
I had a look at the |TxDb.*| packages and noticed that they consist of an |SQLite| database and functions that I assume do some queries to give the results. Some of these extractions require computation for things such as UTRs. Is that correct?
The basic genic annotations (promoters, introns, exons, UTRs) are from the |knownGene| table from UCSC, so they are the same as |TxDb.|. However, we pre-compute the following annotations that are not immediately in |TxDb.| to save the user runtime: first introns, first exons, intron/exon boundaries, exons and introns occurring only in CDSs, and exons and introns occurring only in 5'UTRs and 3'UTRs.
I'm really concerned that these speed optimizations come at a big price in terms of flexibility and duplication of effort. Here are the first exons
first_by_tx <- function(txdb) { exbytx = exonsBy(txdb, "tx") unlist(exbytx)[unlist(exbytx)$exon_rank == 1] }
system.time(first_by_tx(TxDb.Hsapiens.UCSC.hg19.knownGene)) user system elapsed 1.464 0.008 1.469
much of the time is in the data base query
system.time(exbytx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")) user system elapsed 1.324 0.004 1.329 system.time(ans <- unlist(exbytx)[unlist(exbytx)$exon_rank == 1]) user system elapsed 0.072 0.012 0.083
which can be shared across calculations, e.g., first and last.
In addition to genic annotations, |annotatr.data| includes CpG related annotations (island, shore, shelf, and interCGI), enhancer annotations, and intron/exon boundary annotations, which are not found in the |TxDb.*| family of packages.
yep, the new annotations are potentially interesting (are intron / exon boundaries easily derived from the exon coordinates in TxDb), and I'm all for making these available.
My preference, actually, would be to add these as resources in AnnotationHub, and actually
library(AnnotationHub) hub = AnnotationHub() query(hub, c("CpG Island", "hg19")) AnnotationHub with 1 record
snapshotDate(): 2016-08-15
names(): AH5086
$dataprovider: UCSC
$species: Homo sapiens
$rdataclass: GRanges
$title: CpG Islands
$description: GRanges object from UCSC track 'CpG Islands'
$taxonomyid: 9606
$genome: hg19
$sourcetype: UCSC track
$sourceurl:
rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database...
$sourcelastmodifieddate: NA
$sourcesize: NA
$tags: c("cpgIslandExt", "UCSC", "track", "Gene", "Transcript",
"Annotation")
retrieve record with 'object[["AH5086"]]'
hub[["AH5086"]] # first time downloads to local, persistent, cache
system.time(cpg <- hub[["AH5086"]]) loading from cache '/home/mtmorgan/.AnnotationHub/5086' user system elapsed 0.044 0.000 0.262 cpg GRanges object with 28691 ranges and 1 metadata column: seqnames ranges strand | name
| [1] chr1 [ 28736, 29810] \* | CpG:_116 [2] chr1 [135125, 135563] \* | CpG:_30 [3] chr1 [327791, 328229] \* | CpG:_29 [4] chr1 [437152, 438164] \* | CpG:_84 [5] chr1 [449274, 450544] \* | CpG:_99 ... ... ... ... . ... [28687] chr9_gl000201_random [ 15651, 15909] \* | CpG:_30 [28688] chr9_gl000201_random [ 26397, 26873] \* | CpG:_43 [28689] chr11_gl000202_random [ 16284, 16540] \* | CpG:_23 [28690] chr17_gl000204_random [ 54686, 57368] \* | CpG:_228 [28691] chr17_gl000205_random [117501, 117801] \* | CpG:_23 --- seqinfo: 93 sequences (1 circular) from hg19 genome
but in full disclosure
query(hub, c("CpG Island", "hg38")) AnnotationHub with 0 records
snapshotDate(): 2016-08-15
which just means that the hub should be updated, so all can benefit. The process for adding new resources to AnnotationHub is at http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHubData/inst/doc/IntroductionToAnnotationHubData.html though in general we are more than happy to help this along.
Martin
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/Contributions/issues/87#issuecomment-241934869, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPGHvaeO2uTiIwh7x1RrnCNCGI8KIPks5qi6J7gaJpZM4Jnhh3.
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
I guess I was mis-remembering import.bed and implementation in C, sorry about that.
Hi Martin,
I wanted to clarify our thread into an actionable list. Let me know what you think of the following:
rtracklayer::import.bed()
doesn't cover all the file formats annotatr
needs to be able to handle (this already occurs for us in a pipeline we're developing), is it acceptable to continue using annotatr::read_regions()
and not use rtracklayer::import.bed()
?annotatr::annotate_regions()
, I can play around with implementing GenomicRanges
output for that function. However, tibble output is still needed for the plot_*()
and summarize_*()
families of functions. It is a bit awkward to have the user apply a function to cast the GenomicRanges
object to tbl_df
in order to use the plot_*()
and summarize_*()
, but that is one solution. Is this an acceptable compromise?TxDb.*
, something I didn't mention before is I collapsed the annotations by genomic location and type, so that, for example, an exon that belongs to 20 isoforms (with identical start and end) appears only once in the exon annotation, but has 20 comma-delimited tx_id
s in the correct field. Representing the annotations in this way saves time at the intersection step by reducing the search space, and reduces the size of the output reporting the annotations. Perhaps that ends up being negligible in reality. I'd have to do some tests to see how that worked out.TxDb.*
, what I'm hearing is that I should calculate the annotations that we want on the fly, with TxDb.*
, when a user wants to do annotation. Do I understand correctly? As far as new annotations like intron/exon boundaries, are you thinking I would write a function to add to GenomicFeatures
similar to fiveUTRsByTranscript()
that queried the underlying SQLite database of TxDb.*
and constructed the appropriate annotation? We had wanted to differentiate between intron-into-exon and exon-into-intron, just as an aside. Would doing something similar work for the CpG shores, shelves, and interCGI space? And where would I put other annotations that perhaps don't work in TxDb.*
, like the enhancers? I'm concerned that without an annotatr.data
package, we would go over the Bioconductor package size limit.Again, thanks for your thoughts, and helping me through my first Bioconductor submission.
-Raymond
as=c("tbl", "GRanges")
and in the body of the code use match.arg()
; it'll default to tbl. The final line of annotate_region would coerce to GRanges (via a helper function of some kind, I guess) if appropriate.Something like?
gr = GRanges(rep(c("chr1", "chr2"), 5), IRanges(1, 5), id=c(1:5, 11:15))
ugr = unique(granges(gr))
mcols(ugr)$id = splitAsList(gr$id, match(gr, ugr))
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
I was waiting for a version bump and successful build of your package before reviewing...
Received a valid push; starting a build. Commits are:
82f6f41 Version bump + reduce R dependency
Received a valid push; starting a build. Commits are:
bd0feee Version bump + reduce R dependency
Hi Martin,
I bumped the version numbers and changed the R dependency to at least 3.2.0. Since the start of semester I haven't had the time to complete the changes you've suggested. I've especially been having difficulty with the TxDb.* databases and understanding how those changes will work.
As for adding hg38 CpG islands and gaps (and the same for mm10), which are going to be required for us, it's not clear to me from the documentation whether these objects need a recipe? Or if I need to create a database object for them?
Thanks, Raymond
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr_buildreport_20160916095503.html
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160916095745.html
It appears that these warnings relate to my allowing R 3.2.0. Is there a particular reason for this, or is that a general recommendation?
From the Bioconductor perspective, your package will be introduced and tested in the 'devel' branch, and then migrated to the release branch in October. The current devel / release branches are tested against R-3.3.1, so it is 'misleading advertising' and more-or-less irrelevant to say that your package works with R-3.2.0.
For the annotation resources, I think that you should find the corresponding hg19 resources in AnnotationHub, and verify that your code would work with comparable hg38 data. Then let us know precisely what the hg19 AnnotationHub resources are, so that we (bioconductor personell) can add the hg38 analogs. I think that if you specify this, we can review your package and work to add the appropriate annotation resources in parallel.
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
We only start builds when the Version
field in the DESCRIPTION
file is incremented. For example, by changing
Version: 0.99.0
to
Version 0.99.1
If you did not intend to start a build, you don't need to
do anything. If you did want to start a build, increment
the Version:
field and try again.
Received a valid push; starting a build. Commits are:
8bad13e Forgot to version bump
Received a valid push; starting a build. Commits are:
ae03d6c Forgot to version bump
I will go through the hg19 resources and indicate their analogues for hg38, thank you for helping me with that in parallel.
I have changed the dependency back to 3.3.0. I understand that from Bioconductor's perspective the build system is for the current devel version, and so they cannot verify backwards compatibility.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr_buildreport_20160916114912.html
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/annotatr.data_buildreport_20160916115129.html
It looks like annotatr.data is fine, but for annotatr, it looks like the ERROR on zin1 is the mailing list again:
* Checking for bioc-devel mailing list subscription...
Error in curl::curl_fetch_memory(url, handle = handle) :
Server returned nothing (no headers, no data)
Calls: <Anonymous> ... request_fetch -> request_fetch.write_memory -> <Anonymous> -> .Call
As for the moscato WARNING, there was this problem, which never occurred before, and I've only changed dependency versions, which makes it a rather odd WARNING.
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called 'optparse'
Error: package 'annotatr.data' could not be loaded
Execution halted
A little late to the party, but referring to the discussion two weeks ago,
fortify()
method to support more complex data structures with ggplot2.exons()
on TxDb already yields the output you describe, i.e., unique regions that have a collapsed tx_id
column. I'm concerned that you mention CSV storage of the transcript IDs. Hopefully that only pertains to the on-disk representation.Hello,
With respect to BED parsing, as I've noted above, and Martin seemed to acquiesce, not all files giving genomic regions are BED or BED-like. In particular, bismark
outputs non-BED files. This package was partly developed to annotate that data without requiring a user to reformat output. As for 0- or 1-based regions, the annotations are all 0-based as they come from the UCSC genome browser, and I consider it the user's responsibility to know what they were doing as far as that goes.
With respect to your second item, I guess I thought of it as a way to check that the user comes to the plot_*()
functions with the result of annotate_regions()
. I don't doubt that perhaps that's not the best thing to do. Since I used dplyr::bind_rows()
instead of base::Reduce(rbind, ...)
(which appeared to be much faster in the tests I did), the tbl_df
was sort of coerced. I'm not familiar with the concept of fortify()
as it relates to ggplot2
, so that would explain why I didn't go that route.
The annotations are stored in data/
and are compressed RData
files. I have raw data in data-raw/
because I was under the impression that including it was good practice... Also, data-raw/
is in .Rbuildignore
, so it isn't included in the prepared package. As for exons()
, Martin illustrated what you described above.
I welcome your suggestions and comments, but I do ask that you consider the fact that our backgrounds are likely different, and that I will not always be aware of practices that you consider to be common knowledge.
Thanks.
col_names
. Maybe I missed something, but is there a convenience function for a standard format like BED or GFF? If not, you could add one, and maybe it could call rtracklayer for the standard formats, at least eventually.GRanges
but also accepting them as input. Btw, do you mean do.call(rbind, list)
instead of Reduce
? Using Reduce
would be ridiculously slow for long inputs. I do eventually want to fix the slowness of R connections. I encourage you to look into fortify()
as that's critical for interoperating between formal data structures and ggplot2.exons()
already does what you want, without using the code provided by Martin.[1]. Since the function in rtracklayer
doesn't support the ENCODE flavor of BED files (narrowPeak, broadPeak, etc), it seems a bit cumbersome to read those files with rtracklayer
(colClasses
needs to be employed, similar to what I did with read_regions()
). Is there any plan to support the ENCODE flavor of BEDs? This would be really convenient and I bet a lot of users would be happy about it. When you say convenience function for BED, GFF, etc. do you mean revert to the rtracklayer
function, or do something like extension detection and then autofill the col_names
and col_types
in read_regions()
? I suppose either is fine for me, I just want to clarify that point.
[2]. I didn't realize Reduce
was always slow for long inputs, so my mistake. From your comment, it seems that you would prefer GRanges
be accepted as input to the plot_*()
functions? In essence, the fortify()
method would need to flatten out the GRanges
for use in ggplot2
functions that are in the plot_*()
functions? That would seem to imply that annotate_regions()
would by default output a GRanges
object, and if a user wanted to write it, I'd need to spend the time flattening it into a table for them there too. As Martin and I discussed above, it would seem that giving the user the option of GRanges
or tbl_df
as outputs for annotate_regions()
would be the easiest route. As a side note, when looking into ?fortify
I get the following, which makes me wonder about the correct way forward:
Rather than using this function, I now recomend using the ‘broom’
package, which implements a much wider range of methods. ‘fortify’
may be deprecated in the future.
[3]. I'm working on the TxDb.*
transition, and may ask questions about things I don't understand about GRangesList
s later.
extraCols
argument. I don't think it supports specifying the data type, so there is some inefficiency in the type inference, but that might not matter.annotate_regions()
.Could you give an example of reading a narrowPeak file? I just downloaded a random narrowPeak from ENCODE and got:
# head ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak
chr19 49091824 49092345 . 0 . 889.660167307221 -1 3.78440330172555 289
chr19 3184015 3184611 . 0 . 776.503399491842 -1 3.78440330172555 258
chr22 19710615 19711165 . 0 . 747.097041388992 -1 3.78440330172555 331
chr19 54605508 54606001 . 0 . 719.503478674196 -1 3.78440330172555 288
class_names = c('name','score','strand','signalValue','pValue','qValue','peak')
class_type = c('character','integer','character','numerical','numerical','numerical','integer')
names(class_type) = class_names
#> class_type
# name score strand signalValue pValue qValue
#"character" "integer" "character" "numerical" "numerical" "numerical"
# peak
# "integer"
test = import.bed('ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak', extraCols = class_type)
#Error in methods::as(data[[i]], colClasses[i]) :
# no method or default for coercing “character” to “numerical”
Did I construct the extraCols
parameter correctly? I assumed extra refers to after column 3. Perhaps in a future release an example could be given in the documentation for the use of extraCols
.
I thought perhaps extra refers to after column 6, but I got the same error.
Continuing with this point, I wanted to put the two pieces of code side-by-side for comparison.
Here is rtracklayer
:
class_type = c('numerical','numerical','numerical','numerical')
class_names = c('signalValue','pValue','qValue','peak')
names(class_type) = class_names
test = import.bed('ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak',
genome='hg19',
extraCols = class_type)
And here is annotatr
:
col_names = c('chr','start','end','name','score','strand','signalValue','pValue','qValue','peak')
col_types = 'ciicicdddi'
test2 = read_regions('ENCFF949LLJENCFF036YYM.raw.srt.filt.srt.nodup.PE2SE.regionPeak',
genome = 'hg19',
col_names = col_names,
col_types = col_types)
To my eye, this is essentially the same approach, but there are two points that I would like to make.
import.bed()
. import.bed()
with genome
specified is around 10x slower than read_regions()
. When I do not include genome
in import.bed()
, it is about the same and slightly faster than read_regions()
.library(microbenchmark)
library(rtracklayer)
library(annotatr)
read_rtrack_genome = function(){
test = import.bed('test.bed', genome='hg19')
}
read_rtrack_nogenome = function(){
test = import.bed('test.bed')
}
read_annotatr = function(){
test = read_regions(file = 'test.bed', genome='hg19', stranded = FALSE, col_names = c('chr','start','end'))
}
test_read = microbenchmark(read_rtrack_genome(), read_rtrack_nogenome(), read_annotatr(), times = 20)
# Unit: milliseconds
# expr min lq mean median uq
# read_rtrack_genome() 11162.2130 11957.1516 12209.441 12067.947 12372.063
# read_rtrack_nogenome() 870.9019 959.1449 1078.028 1124.781 1194.777
# read_annotatr() 1107.8230 1151.1748 1354.751 1206.989 1467.281
# max neval cld
# 14257.982 20 b
# 1233.394 20 a
# 2259.304 20 a
While I appreciate the notion of reusing code, in this particular case, I do not see the advantage of using rtracklayer::import.bed()
over what I have implemented based on readr
.
In the thread above, @mtmorgan had agreed that this part of the package would not need to be changed, so I ask for your agreement in order to move the submission forward.
I am actively working on changing the basis for annotations to the TxDb.*
group of packages, and expect to push those changes within the next couple days.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
REPOSITORY UPDATED to reflect software-only contribution.
Confirm the following by editing each check box to '[x]'
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.