brainflowprobes package submission

aprice26 commented 5 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/LieberInstitute/brainflowprobes

Confirm the following by editing each check box to '[x]'

[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

I am familiar with the essential aspects of Bioconductor software management, including:

[x] The 'devel' branch for new packages and features.
[x] The stable 'release' branch, made available every six months, for bug fixes.
[x] Bioconductor version control using Git (optionally via GitHub).

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

(next line added by mtmorgan during package acceptance)

AdditionalPackage: https://github.com/LieberInstitute/GenomicState

bioc-issue-bot commented 5 years ago

Hi @aprice26

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: brainflowprobes
Type: Package
Title: Plots and annotation for choosing BrainFlow target probe sequence 
Version: 0.99.0
Authors@R: c(person("Amanda", "Price", email = "amanda.joy.price@gmail.com", 
  role = c("aut", "cre"), comment = c(ORCID = "0000-0001-7352-3732")),
  person("Leonardo", "Collado-Torres", role = c("ctb"), 
  email = "lcolladotor@gmail.com", comment = c(ORCID = "0000-0003-2140-308X")))
Description: Use these functions to characterize genomic regions for
  BrainFlow target probe design.
License: Artistic-2.0
Encoding: UTF-8
LazyData: true
Depends:
    R (>= 3.6.0)
Imports:
    Biostrings (>= 2.52.0),
    BSgenome.Hsapiens.UCSC.hg19 (>= 1.4.0),
    bumphunter (>= 1.26.0),
    cowplot (>= 1.0.0),
    derfinder (>= 1.18.1),
    derfinderPlot (>= 1.18.1),
    GenomicRanges (>= 1.36.0),
    ggplot2 (>= 3.1.1),
    RColorBrewer (>= 1.1),
    utils,
    grDevices
RoxygenNote: 6.1.1
Suggests: 
    BiocStyle,
    knitcitations,
    knitr,
    rmarkdown,
    sessioninfo,
    testthat (>= 2.1.0)
VignetteBuilder: knitr
URL: https://github.com/LieberInstitute/brainflowprobes
BugReports: https://support.bioconductor.org/t/brainflowprobes/
biocViews: Coverage, Visualization, ExperimentalDesign, Transcriptomics, 
    FlowCytometry, GeneTarget

Add SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account.

aprice26 commented 5 years ago

Dear Bioconductor package reviewer,

The brainflowprobes package passes R CMD check and R CMD BiocCheck except for a warning and an error related to the size of a file (and hence the package) as shown here. That is, the package passes these checks for the most part, except for the 5 Mb size limit. The package functions rely on two objects that take about 40 and 200-500 seconds to re-make from scratch, hence why they are currently included in the data/ directory. One of them is 13 mb in disk, bringing the total installed package size to ~20 mb; thus triggering a warning and an error on BiocCheck.

We communicated this to Lori Shepherd who recommended we submit the package as-is and proceed with the review process.

Best, Amanda

(cc @lcolladotor)

bioc-issue-bot commented 5 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

ecd54f5 v0.99.2 -- testing the BioC SBP webhook that I jus...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

Liubuntu commented 5 years ago

Hi @aprice26 ,

Thanks for the updates! I'll proceed with the initial review. Let's see if we can clear the error or other issues when we look inside the package.

Best, Qian

Liubuntu commented 5 years ago

Hi @aprice26 ,

Please see below for the initial review of your package. Seek to address all or most of the issues and comment back here with any questions / updates, and when you are ready for a 2nd look.

Cheers, Qian

DESCRIPTION

BugReports: It's usually recommended to use the issue tracker on github for reporting issues.

NAMESPACE

Looks good!

R/

the elements of four_panels_example_cov are all lists with only 1 element each (matrix), which is contrary to the documentation, describing as matrices. Please explain if these are intended or not, and also make the documentations consistent.

> sapply(four_panels_example_cov, class)
   Sep    Deg   Cell   Sort
"list" "list" "list" "list"
> sapply(four_panels_example_cov, length)
 Sep  Deg Cell Sort
   1    1    1    1

four_panels.R:

L55. Use if(JUNCTIONS) instead to avoid double check!
L58:64. Seems the calculations are based on that the "COVERAGE" is a list of list. Check above question for the intended values.
L55:99. Highly duplicated code. The main difference are the calculation of regionCov if JUNCTION is TRUE. So only do this under the first if condition and no need for the else statement.
function definition is pretty long. Consider writing utility functions for single purposes and call utility function inside the exported functions. NOTE that the internal utility functions are usually prefixed with .. Refer here.

plot_coverage.R

L82:115 are highly duplicated to four_panels.R. Consider writing some internal utility functions (e.g, separate functions for checking the input file format, and certain R value assignments), and call the utility functions inside these exported functions. These will make the code more robust and easy to maintain.

region_info.R

duplicate code founded to the above 2 scripts. Use internal functions.
avoid using "xx == TRUE" for double checking in R scripts. Use xx or !xx instead.

man/

Need documentation for the package itself.

vignette

"Basics" section:

install BiocManager conditionally:

if (!requireNamespace("BiocManager", quietly = TRUE))
 install.packages("BiocManager")

It's suggested to also include the instruction for installing the development version of this package through github:
```
BiocManager::install("LieberInstitute/brainflowprobes")
```

Links included here does not work:

If you are asking yourself the question "How do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startBioC/#.VkOKbq6rRuU).

unit tests

ok.

NEWS, readme, etc

ok.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

b38ce9c v0.99.3 -- address @Liubuntu's requests from https...

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

8b5d39f 0.99.4 -- fix the speed gains for plot_coverage() ...

lcolladotor commented 5 years ago

Hi @Liubuntu,

Thank you for your review of brainflowprobes! I took the liberty to address the suggestions you addressed to Amanda @aprice26. There are more unit tests, helper functions and other improvements.

Let us know if you have any questions!

Best, Leo

PS I also fixed the URL (it's all lowercase now) in several other R packages I maintain.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

Liubuntu commented 5 years ago

Hi @lcolladotor ,

Thanks for the updates. The newly added documentations, utility functions and unit tests look great. Here is one more issue from the 2nd review. Please comment back for any question or updates!

Best, Qian

R/four_panels.R

currently this function uses the working directory as the default path to save the generated pdf file, which is not recommended! Because the examples in help page and in vignette will generate files in home directory whenever the R CMD build/check is called, e.g., in the Bioconductor building machine with daily build and check. This fills the building machine space quickly and requires manual cleaning.

If the file path to save results in working directory is not strictly required, it is recommended to add a new argument, e.g., outdir, for the path and use temporary directory as default, so that it cleans by itself after each build. Users could also have the flexibility to specify their own path (also update in parameter documentations). The current PDF could stay unchanged to only represent the pdf file name. You may need to reconstruct the PDF inside the script early to have your current code working:
```
PDF <- file.path(outdir, PDF)
```
Also update the documentation and examples accordingly.

Liubuntu commented 5 years ago

Hi @aprice26 and @lcolladotor ,

I have just checked the data-raw/create_sysdata.R file, and found that you have actually created a txdb object. Would you consider this data to be useful to a broader Bioconductor user? If yes, you may have it prepared as an AnnotationHub package (@lshep could help for any question), and import this package when creating the other data sets currently inside data/ folder.

Qian

Liubuntu commented 5 years ago

Any updates on this? @aprice26 @lcolladotor The issue should be easily fixed.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

b2472db v0.99.5 -- add the OUTDIR argument to check_pdf() ...

lcolladotor commented 5 years ago

Hi @Liubuntu,

brainflowprobes version 0.99.5 now has the OUTDIR argument which should resolve your comment on that subject https://github.com/Bioconductor/Contributions/issues/1191#issuecomment-530115005.

Regarding the Gencode version 31 lifted over from hg38 to hg19 TxDb object we make at https://github.com/LieberInstitute/brainflowprobes/blob/master/data-raw/create_sysdata.R#L1-L31 (filtered to the canonical chromosomes), I see that there are no AnnotationHub TxDb objects for that annotation. Which might be a case for creating one. However, for brainflowprobes we need the output of derfinder::makeGenomicState() which will take several minutes to run. That is part of https://github.com/LieberInstitute/brainflowprobes/blob/master/data-raw/create_sysdata.R#L34-L68 which is the data actually provided in brainflowprobes. So we would like to provide the gs object in this package anyway.

Thus an AnnotationHub package for brainflowprobes is not really required (if you are ok with the 21.8 Mb installed size of the package). If not, we can create an AnnotationHub package with the TxDb, gs and genes objects.

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2019-05-02
> query(ah, c('Gencode', 'v31', 'Homo sapiens'))
AnnotationHub with 0 records
# snapshotDate(): 2019-05-02 
> packageVersion('AnnotationHub')
[1] ‘2.16.1’

Best, Leo

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

a0e584d v0.99.6 -- fix a link to the doc file for GRanges

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

Liubuntu commented 5 years ago

Hi @aprice26 and @lcolladotor ,

We had a discussion with the Bioconductor team. Since there are already some data available for gencode in other version, we think it should be useful to have your gencode data for v31 on the AnnotationHub.

Also for the "genes.rda" and "gs.rda", we would suggest to make them available on ExperimentHub, to avoid the building error from very large tarball size. You'll need to: 1) upload AnnotationHub data (prepare the script for generating the data). 2) upload the ExperimentHub data. 3) then create a software package "brainflowprobesData??", include scripts for generating the datasets for "gene" and "gs" from the Annotation resources, and define functions for retrieving the ExperimentHub data. 4) In this package "brainflowprobes", depend on the above package, and call the functions defined to load the ExperimentHub data.

@lshep , please feel free to modify the above steps if necessary. Questions about AH and EH could also be address to Lori.

Best, qian

lcolladotor commented 5 years ago

Hi @lshep @Liubuntu,

Regarding the AnnotationHub package, I looked at the current Gencode v23 files as shown below which lead me to GencodeGffImportPreparer.

library('AnnotationHub')
ah <- AnnotationHub()
# snapshotDate(): 2019-05-02
q <- query(ah, c('Gencode', 'v23', 'Homo sapiens'))
mcols(q)
# DataFrame with 9 rows and 15 columns
unique(q$preparerclass)
# [1] "GencodeGffImportPreparer"
packageVersion('AnnotationHub')
# [1] ‘2.16.1’

From the Google search results for GencodeGffImportPreparer I found https://rdrr.io/github/Bioconductor/AnnotationHubData/src/R/makeGencodeGFF.R which seems like it has everything it needs. I tweaked the code a little bit at https://gist.github.com/lcolladotor/bb8cacb7237a13c092911cf8f2ac7eac/revisions (I made it into a gist so you could see the diffs). I added AnnotationHubData::: to some calls just to test locally, but you could remove those. Basically, I modified .gencodeSourceUrls() such that it would detect the correct genome version for the files that have been lifted over. Then I also changed makeGencodeGFFsToAHMs() such that it would take parameters and pass them to .gencodeSourceUrls(). So this is how it looks with release = '23:

> .gencodeSourceUrls(species = 'Human', release = '23',
+         filetype = 'gff', justRunUnitTest = FALSE)
getting file info: gencode.v23.2wayconspseudos.gff3.gz
getting file info: gencode.v23.annotation.gff3.gz
getting file info: gencode.v23.basic.annotation.gff3.gz
getting file info: gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz
getting file info: gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz
getting file info: gencode.v23.long_noncoding_RNAs.gff3.gz
getting file info: gencode.v23.polyAs.gff3.gz
getting file info: gencode.v23.primary_assembly.annotation.gff3.gz
getting file info: gencode.v23.tRNAs.gff3.gz
getting file info: gencode.v23lift37.annotation.gff3.gz
getting file info: gencode.v23lift37.basic.annotation.gff3.gz
getting file info: gencode.v23lift37.long_noncoding_RNAs.gff3.gz
getting file info: gencode.v23lift37.unmapped.gff3.gz
                                                                                                                           fileurl date size
1                           ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.2wayconspseudos.gff3.gz <NA>   NA
2                                ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.annotation.gff3.gz <NA>   NA
3                          ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz <NA>   NA
4           ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz <NA>   NA
5     ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz <NA>   NA
6                       ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.long_noncoding_RNAs.gff3.gz <NA>   NA
7                                    ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.polyAs.gff3.gz <NA>   NA
8               ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.primary_assembly.annotation.gff3.gz <NA>   NA
9                                     ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.tRNAs.gff3.gz <NA>   NA
10          ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.annotation.gff3.gz <NA>   NA
11    ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.basic.annotation.gff3.gz <NA>   NA
12 ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.long_noncoding_RNAs.gff3.gz <NA>   NA
13            ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.unmapped.gff3.gz <NA>   NA
                                                                               rdatapath
1                           Gencode_human/release_23/gencode.v23.2wayconspseudos.gff3.gz
2                                Gencode_human/release_23/gencode.v23.annotation.gff3.gz
3                          Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz
4           Gencode_human/release_23/gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz
5     Gencode_human/release_23/gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz
6                       Gencode_human/release_23/gencode.v23.long_noncoding_RNAs.gff3.gz
7                                    Gencode_human/release_23/gencode.v23.polyAs.gff3.gz
8               Gencode_human/release_23/gencode.v23.primary_assembly.annotation.gff3.gz
9                                     Gencode_human/release_23/gencode.v23.tRNAs.gff3.gz
10          Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.annotation.gff3.gz
11    Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.basic.annotation.gff3.gz
12 Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.long_noncoding_RNAs.gff3.gz
13            Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.unmapped.gff3.gz
                                                                                                                                                                                                                                                                                                                                       description
1                                                                                                                                                                                                                                                   pseudogenes predicted by the Yale & UCSC pipelines, but not by Havana on reference chromosomes
2                                                                                                                                                                                                                                                                                           Gene annotations on reference chromosomes from Gencode
3                                                                                                                                                                                                                                                                                           Gene annotations on reference chromosomes from Gencode
4                                                                                                                                                                                                                                                               Gene annotation on reference-chromosomes/patches/scaffolds/haplotypes from Gencode
5                                                                                                                                                                                                                                                                                           Gene annotations on reference chromosomes from Gencode
6  sub-set of the main annotation files on the reference chromosomes. They contain only the lncRNA genes. Long non-coding RNA genes are considered the genes with any of those biotypes: 'processed_transcript', 'lincRNA', '3prime_overlapping_ncrna', 'antisense', 'non_coding', 'sense_intronic' , 'sense_overlapping' , 'TEC' , 'known_ncrna'.
7                                                                                                                                                                                                                        files contain polyA signals, polyA sites and pseudo polyAs manually annotated by HAVANA from only the refrence chromosome
8                                                                                                                                                                                                                                                                                           Gene annotations on reference chromosomes from Gencode
9                                                                                                                                                                                                                                                                                  tRNA structures predicted by tRNA-Scan on reference chromosomes
10                                                                                                                                                                                                                                                                                          Gene annotations on reference chromosomes from Gencode
11                                                                                                                                                                                                                                                                                          Gene annotations on reference chromosomes from Gencode
12 sub-set of the main annotation files on the reference chromosomes. They contain only the lncRNA genes. Long non-coding RNA genes are considered the genes with any of those biotypes: 'processed_transcript', 'lincRNA', '3prime_overlapping_ncrna', 'antisense', 'non_coding', 'sense_intronic' , 'sense_overlapping' , 'TEC' , 'known_ncrna'.
13                                                                                                                                                                                                                                                                                                                                                
                                                     tags      species taxid genome
1                        gencode,v23,2wayconspseudos,gff3 Homo sapiens  9606 GRCh38
2                             gencode,v23,annotation,gff3 Homo sapiens  9606 GRCh38
3                       gencode,v23,basic,annotation,gff3 Homo sapiens  9606 GRCh38
4        gencode,v23,chr_patch_hapl_scaff,annotation,gff3 Homo sapiens  9606 GRCh38
5  gencode,v23,chr_patch_hapl_scaff,basic,annotation,gff3 Homo sapiens  9606 GRCh38
6                    gencode,v23,long_noncoding_RNAs,gff3 Homo sapiens  9606 GRCh38
7                                 gencode,v23,polyAs,gff3 Homo sapiens  9606 GRCh38
8            gencode,v23,primary_assembly,annotation,gff3 Homo sapiens  9606 GRCh38
9                                  gencode,v23,tRNAs,gff3 Homo sapiens  9606 GRCh38
10                      gencode,v23lift37,annotation,gff3 Homo sapiens  9606 GRCh37
11                gencode,v23lift37,basic,annotation,gff3 Homo sapiens  9606 GRCh37
12             gencode,v23lift37,long_noncoding_RNAs,gff3 Homo sapiens  9606 GRCh37
13                        gencode,v23lift37,unmapped,gff3 Homo sapiens  9606 GRCh37

I think that it makes more sense for the Bioconductor maintainer account to run this for Gencode releases that are missing (like v31) instead of keeping it hardcoded to just v23. Then I can use these new AnnotationHub entries for the ExperimentHub package you described. That could be done with something like:

## Only for human here
makeGencodeGFFsToAHMs_multiple_human <- function() {
    ## Here just two for testing:
    releases <- c('23', '31')
    ## For all including v23
    # releases <- as.character(23:31)
    hubs <- lapply(releases, function(rel) makeGencodeGFFsToAHMs(release = rel))
    unlist(hubs)
}

## Manual check (showing the end of `unlist(hubs)` here):

# $...
# class: AnnotationHubMetadata
# AnnotationHubRoot: NA
# BiocVersion: 3.9
# Coordinate_1_based: TRUE
# DataProvider: Gencode
# DerivedMd5: NA
# Description: sub-set of the main annotation files on the reference chromosomes. They contain only the lncRNA genes. Long non-coding RNA genes are considered the genes with any of those biotypes:
#   'processed_transcript', 'lincRNA', '3prime_overlapping_ncrna', 'antisense', 'non_coding', 'sense_intronic' , 'sense_overlapping' , 'TEC' , 'known_ncrna'.
# DispatchClass: GFF3File
# Error: NA_character
# Genome: GRCh37
# HubRoot: NA
# Location_Prefix: ftp://ftp.ebi.ac.uk/pub/databases/gencode/
# Maintainer: Bioconductor Maintainer <maintainer@bioconductor.org>
# Notes: NA
# PreparerClass: NA
# RDataClass: GRanges
# RDataDateAdded: 2019-10-02
# RDataPath: Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.long_noncoding_RNAs.gff3.gz
# Recipe: NA
# SourceLastModifiedDate: NA
# SourceMd5: NA
# SourceSize: NA
# SourceType: GFF
# SourceUrl: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.long_noncoding_RNAs.gff3.gz
# SourceVersion: NA
# Species: Homo sapiens
# Tags: gencode v31lift37 long_noncoding_RNAs gff3
# TaxonomyId: 9606
# Title: gencode.v31lift37.long_noncoding_RNAs.gff3.gz
#
# [[26]]
# class: AnnotationHubMetadata
# AnnotationHubRoot: NA
# BiocVersion: 3.9
# Coordinate_1_based: TRUE
# DataProvider: Gencode
# DerivedMd5: NA
# Description:
# DispatchClass: GFF3File
# Error: NA_character
# Genome: GRCh37
# HubRoot: NA
# Location_Prefix: ftp://ftp.ebi.ac.uk/pub/databases/gencode/
# Maintainer: Bioconductor Maintainer <maintainer@bioconductor.org>
# Notes: NA
# PreparerClass: NA
# RDataClass: GRanges
# RDataDateAdded: 2019-10-02
# RDataPath: Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.unmapped.gff3.gz
# Recipe: NA
# SourceLastModifiedDate: NA
# SourceMd5: NA
# SourceSize: NA
# SourceType: GFF
# SourceUrl: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.unmapped.gff3.gz
# SourceVersion: NA
# Species: Homo sapiens
# Tags: gencode v31lift37 unmapped gff3
# TaxonomyId: 9606
# Title: gencode.v31lift37.unmapped.gff3.gz

makeAnnotationHubResource("GencodeGffImportPreparer",
                          makeGencodeGFFsToAHMs_multiple_human)

If there's a GitHub repository for AnnotationHubData I can submit my changes as a pull request.

How does this sound?

Best, Leo

Liubuntu commented 5 years ago

I'll follow up once @lshep is finished with the PR, and you have the ExperimentHub data up. @lcolladotor

lcolladotor commented 5 years ago

AdditionalPackage: https://github.com/LieberInstitute/GenomicState

lcolladotor commented 5 years ago

The above likely won't work since https://github.com/Bioconductor/Contributions/blob/master/CONTRIBUTING.md#submitting-related-packages specifies that the author of the issue has to make that comment. Anyway, just checking :P

aprice26 commented 5 years ago

AdditionalPackage: https://github.com/LieberInstitute/GenomicState

bioc-issue-bot commented 5 years ago

Hi @aprice26,

Starting build on additional package https://github.com/LieberInstitute/GenomicState.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your additional package repository will NOT trigger a new build.

The DESCRIPTION file of this additional package is:

Package: GenomicState
Title: Build and access GenomicState objects for use with derfinder tools from
sources like Gencode
Version: 0.99.0
Date: 2019-10-4
Authors@R: 
person("Leonardo", "Collado-Torres", role = c("aut", "cre"), 
email = "lcolladotor@gmail.com", comment = c(ORCID = "0000-0003-2140-308X"))
Description: This package contains functions for building GenomicState objects
from different annotation sources such as Gencode. It also provides access
to these files at JHPCE.
License: Artistic-2.0
Encoding: UTF-8
LazyData: true
Imports: 
GenomicFeatures,
GenomeInfoDb,
rtracklayer,
bumphunter,
derfinder,
AnnotationDbi,
IRanges,
org.Hs.eg.db,
utils,
AnnotationHubData,
AnnotationHub
Roxygen: list(markdown = TRUE)
RoxygenNote: 6.1.1
Suggests: 
knitr,
rmarkdown,
BiocStyle,
knitcitations,
sessioninfo,
testthat (>= 2.1.0),
glue,
derfinderPlot
VignetteBuilder: knitr
URL: https://github.com/LieberInstitute/GenomicState
BugReports: https://github.com/LieberInstitute/GenomicState/issues
biocViews: Coverage, Transcriptomics, Homo_sapiens, TxDb, AnnotationHub
Remotes: lcolladotor/bumphunter@fix_namespace_genes

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

60b636e v0.99.1 -- fixed some minor bugs

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

979be8c Remove GenomicState.Rproj to comply with BiocCheck... b6f8f57 Getting ready for 0.99.2

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

23b0fa1 v0.99.3 -- bump version for the bioc-issue-bot

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

3b34f54 v0.99.4 -- attempt to resolve issues from http://b...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

lcolladotor commented 5 years ago

Hi @Liubuntu (cc @lshep),

I added the GenomicState package (docs at http://research.libd.org/GenomicState/). Instead of an ExperimentHub package, I made it as an AnnotationHub package since it contains really annotation files from Gencode in different formats: TxDb sqlite files, bumphunter::annotateTranscripts(by = 'gene) and derfinder::makeGenomicState() output using the TxDb objects. These are made by GenomicState::gencode_txdb(), GenomicState::annotated_genes() and GenomicState::gencode_genomic_state() respectively. I chose those function names since you could imagine adding more annotation sources later on. The only manipulation I do to the GTF files is subset them to the canonical chromosomes (chrs 1 to 22, X, Y and M). But that should be reasonable I believe for an AnnotationHub package.

The TxDb sqlite files are made using the GTF files from Gencode right now. That is, using https://github.com/LieberInstitute/GenomicState/blob/master/R/gencode_txdb.R which is based off the current code at https://github.com/LieberInstitute/brainflowprobes/blob/master/data-raw/create_sysdata.R. Though once https://github.com/Bioconductor/AnnotationHubData/pull/2 is live, GenomicState::gencode_txdb() could use the Gencode GFF files from AnnotationHub. Regardless of the status of https://github.com/Bioconductor/AnnotationHubData/pull/2 I think that having these files pre-computed would be useful since the three steps (TxDb building, annotated genes and genomic state) take a bit to run (for example this one took about 10 minutes https://github.com/LieberInstitute/GenomicState/blob/master/data-raw/logs/build_gencode_human_hg38.32.txt#L114).

The idea is that once the data from GenomicState is available through AnnotationHub, I could then change brainflowprobes to use that data through GenomicState::GenomicStateHub(). Currently, I made objects for human genomes hg38 and hg19 for Gencode versions 23 till 32 (latest one). While brainflowprobes only needed the hg19 version 31 files (as was made), we could make brainflowprobes more flexible to use any of the Gencode versions on hg19. Additionally, another member in our group needed these files for hg38 Gencode version 25 and 29 (hence why I made GenomicState::local_metadata()) and would benefit from having the data available through AnnotationHub. This could also help with recountWorkflow where I currently have users make one of these GenomicState objects https://github.com/LieberInstitute/recountWorkflow/blob/master/vignettes/recount-workflow.Rmd#L874 despite the computing time resources it requires. That is, benefit all derfinderPlot users (or whoever wants to build upon the GenomicState objects).

GenomicState depends on AnnotationHub instead of just importing it so users will have the rest of AnnotationHub functions on their search path as GenomicState::GenomicStateHub() returns the result of AnnotationHub::query().

Once the GenomicState data is live through AnnotationHub I can then finish the docs on the package and GenomicState::GenomicStateHub().

Let me know if you have any questions.

Best, Leo

Liubuntu commented 5 years ago

Hi @lshep ,

Please let me know when the data @lcolladotor has prepared are available on AH. I couldn't find anything when adding "hg19", or "hg38" tag yet.

 > query(ah, pattern=c("gencode", "v31"))
 AnnotationHub with 13 records
 # snapshotDate(): 2019-10-08                                                                                                                                                                                      
 # $dataprovider: Gencode                                                                                                                                                                                          
 # $species: Homo sapiens                                                                                                                                                                                          
 # $rdataclass: GRanges                                                                                                                                                                                            
 # additional mcols(): taxonomyid, genome, description,                                                                                                                                                            
 #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,                                                                                                                                          
 #   rdatapath, sourceurl, sourcetype                                                                                                                                                                              
 # retrieve records with, e.g., 'object[["AH75118"]]'                                                                                                                                                              

             title
   AH75118 | gencode.v31.2wayconspseudos.gff3.gz
   AH75119 | gencode.v31.annotation.gff3.gz
   AH75120 | gencode.v31.basic.annotation.gff3.gz
   AH75121 | gencode.v31.chr_patch_hapl_scaff.annotation.gff3.gz
   AH75122 | gencode.v31.chr_patch_hapl_scaff.basic.annotation.gff3.gz
   ...       ...
   AH75126 | gencode.v31.tRNAs.gff3.gz
   AH75127 | gencode.v31lift37.annotation.gff3.gz
   AH75128 | gencode.v31lift37.basic.annotation.gff3.gz
   AH75129 | gencode.v31lift37.long_noncoding_RNAs.gff3.gz
   AH75130 | gencode.v31lift37.unmapped.gff3.gz                                                                                                                

 > query(ah, pattern=c("gencode", "v31", "hg19"))
 AnnotationHub with 0 records
 # snapshotDate(): 2019-10-08                                                                                                                                                                                      

> query(ah, pattern=c("gencode", "v31", "hg38"))
 AnnotationHub with 0 records
 # snapshotDate(): 2019-10-08

Liubuntu commented 5 years ago

Hi @lcolladotor ,

There are some partial review for the vignette of GenomicState:

Installation: Should write as if it was already included in Bioconductor. So include something like:

1. Download the package from Bioconductor.
{r getPackage, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("pkgName")

Or install the development version of the package from Github.
{r, eval = FALSE}
BiocManager::install(“githubUserName/pkgname”)

2. Load the package into R session.
{r Load, message=FALSE}
library(pkgName)

Citation: The Zotero has just added support for Bioconductor packages. So Zotero users could open the package landing page and click on the zotero icon (as browser extensions) to cite it as a software package, with title, author, version, bioc version info. This will be particularly useful if there was no journal publications yet. e.g.,
```
[1]L. Collado-Torres, A. E. Jaffe, and J. T. Leek, derfinderPlot: Plotting functions for derfinder. Bioconductor version: Release (3.9), 2019.
```

lshep commented 5 years ago

I dont know what data you are referring or expecting in the hubs. I added the v31 as requested.

lcolladotor commented 5 years ago

Hi @Liubuntu,

Thanks for the partial review of GenomicState!

I'll make the changes about installing GenomicState shortly.
The citation one I didn't understand what you were suggesting. Like provide a inst/CITATION file for GenomicState? Or are you referring to the citations in the vignette which I made with citation('pkgname')? Or maybe changing the inst/CITATION file for derfinderPlot?

Best, Leo

Hi @lshep,

Sorry for the confusion.

The PR to AnnotationHubData that you merged was about adding GFFs for Gencode version 31 to AnnotationHub (and in general any version from 24 to 32 since 23 was there already) . brainflowprobes though requires objects processed from the annotation that take about 10 minutes to build. Instead of providing that data in the brainflowprobes package, @Liubuntu suggested providing it through Experiment/Annotation hub. That's where the new package submission GenomicState comes in with data for several Gencode versions that I wish to submit to AnnotationHub (Gencode v23 to v32 for hg19 and hg38). That would be the data described by https://github.com/LieberInstitute/GenomicState/blob/master/inst/extdata/metadata_gencode_human.csv which are: TxDb sqlite files + bumphunter::annotateTranscripts(by = 'gene', txdb = TxDb) + derfinder::makeGenomicState(txdb = TxDb) that I described in my detail in my previous comment.

You might say that the TxDb files and the AnnotationHubData GFF files are redundant since you can build the TxDb files from the GFF ones. Though that doesn't work exactly right out of the box as shown in https://github.com/LieberInstitute/GenomicState/blob/master/R/gencode_txdb.R and takes a bit of time to compute.

I do think that https://github.com/LieberInstitute/GenomicState/blob/master/R/gencode_txdb.R#L45-L49 could use the Gencode GFF files from AnnotationHub (or the GRanges built on the fly according to your latest comment on the PR https://github.com/Bioconductor/AnnotationHubData/pull/2#issuecomment-539988403) instead of the Gencode GTF files that it uses currently. I could make this change if you upload to AnnotionHub the GFFs (or GRanges) for Gencode versions 23 to 32 (the ones missing which I think are 24 to 30 and 32).

The demand for more Gencode versions comes from outside of brainflowprobes as we have local users interested in different Gencode versions, which prompted me to think that Bioconductor users in general might want the other versions too.

Let me know if I can clarify anything else or if you want to have a skype chat about this.

Best, Leo

lshep commented 5 years ago

Ok - so we will skip using/generating the GFF in the AnnotationHub and continue forth adding your data in the data package TxDb sqlite and Rda files.
Please upload your data to S3 to continue. If you haven't been given credentials recently please email me to get access.

Liubuntu commented 5 years ago

Hi @lcolladotor ,

For the citation comment, there was nothing wrong with your current package.

I was just mentioning that Bioconductor is supported for direct citation using Zotero reference management software. See here for more details: https://support.bioconductor.org/p/124760/ Basically the returned bibliography includes bioc version (e.g., 3.9 / 3.10), year, DOI, etc. This might be useful to know for package maintainers when they publicly promote their package citation or for general users.

Best, Qian

Liubuntu commented 5 years ago

@lcolladotor ,

Please work with Lori @lshep in uploading the additional file, so that we can move forward with these data packages. Since the release schedule indicates that the last day to accept new packages into Bioc3.10 would be next Wednesday, so that we can have this data available in the new release. Thanks!

Qian

Wednesday October 23

Deadline to add new packages to the BiocC 3.10 manifest. Package submitted to tracker must have
completed the review processes and been accepted to be added to the manifest

lcolladotor commented 5 years ago

Yup I will. I’m at a conference right now and will be back on Monday. Best, Leo

On Fri, Oct 18, 2019 at 2:44 PM Qian Liu notifications@github.com wrote:

@lcolladotor https://github.com/lcolladotor ,

Please work with Lori @lshep https://github.com/lshep in uploading the additional file, so that we can move forward with these data packages. Since the release schedule http://bioconductor.org/developers/release-schedule/ indicates that the last day to accept new packages into Bioc3.10 would be next Wednesday, so that we can have this data available in the new release. Thanks!

Qian

Wednesday October 23

Deadline to add new packages to the BiocC 3.10 manifest. Package submitted to tracker must have completed the review processes and been accepted to be added to the manifest

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/Contributions/issues/1191?email_source=notifications&email_token=AAROUVM3K63WSA5JUPNFGRLQPIGYRA5CNFSM4IG5NHCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBVWFBQ#issuecomment-543908486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAROUVLXYL3N7U67PHYJPS3QPIGYRANCNFSM4IG5NHCA .

lshep commented 5 years ago

@lcolladotor Data has been added to AnnotationHub

> ah = AnnotationHub()
  |======================================================================| 100%

snapshotDate(): 2019-10-22
query(ah, "Genomic> query(ah, "GenomicState")
AnnotationHub with 60 records
# snapshotDate(): 2019-10-22 
# $dataprovider: GENCODE
# $species: Homo sapiens
# $rdataclass: GRanges, TxDb, list
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH75134"]]' 

            title                                              
  AH75134 | TxDb for Gencode v23 on hg19 coordinates           
  AH75135 | Annotated genes for Gencode v23 on hg19 coordinates
  AH75136 | GenomicState for Gencode v23 on hg19 coordinates   
  AH75137 | TxDb for Gencode v23 on hg38 coordinates           
  AH75138 | Annotated genes for Gencode v23 on hg38 coordinates
  ...       ...                                                
  AH75189 | Annotated genes for Gencode v32 on hg19 coordinates
  AH75190 | GenomicState for Gencode v32 on hg19 coordinates   
  AH75191 | TxDb for Gencode v32 on hg38 coordinates           
  AH75192 | Annotated genes for Gencode v32 on hg38 coordinates
  AH75193 | GenomicState for Gencode v32 on hg38 coordinates   
> TxDb = query(ah, "GenomicState")[[1]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

> TxDb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/GRCh37_mapping/gencode.v23lift37.annotation.gtf.gz
# Organism: Homo sapiens
# Taxonomy ID: 9606
# miRBase build ID: NA
# Genome: hg19
# transcript_nrow: 198269
# exon_nrow: 678347
# cds_nrow: 270269
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-10-07 10:00:19 -0400 (Mon, 07 Oct 2019)
# GenomicFeatures version at creation time: 1.36.4
# RSQLite version at creation time: 2.1.2
# DBSCHEMAVERSION: 1.2

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

5176c80 v0.99.5 -- update docs now that data is live at An...

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

375fcf3 v0.99.6 -- update the docs on my laptop, since my ...

Bioconductor / Contributions