Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

(inactive) CustomGenome #3224

Closed mtekman closed 3 months ago

mtekman commented 11 months ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 11 months ago

Hi @mtekman

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: CustomGenome
Type: Package
Title: Append Custom FASTA Sequences to your Genome and Annotation Files for Downstream Mapping
Version: 0.99.0
Authors@R: 
    c(person(given="Mehmet", family="Tekman",
      email = "mtekman89@gmail.com",
      role = c("aut", "cre"), comment = c(ORCID = "0000-0002-4181-2676")),
      person(given="Sebastian", family="Arnold",
      email = "sebastian.arnold@pharmakol.uni-freiburg.de",
      role = c("fnd"), comment = c(ORCID = "0000-0002-2688-9210")))
Description: CustomGenome generates a new Ensembl reference genome
    sequence and annotation files with custom FASTA sequences (e.g.
    GFP, Tomato) and produces RNA-seq count matrices with them. Genome
    reference files are fetched and validated directly from Ensembl,
    user-sequences are then appended to both, and then the new
    reference files are written to disk. Downstream analysis is then
    facilitated via Rsubread function wrappers to index the new
    reference, align user FASTA sequences to it, and then perform
    quantification for desired features (e.g. exon, utr, etc.)
License: GPL-3
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
VignetteBuilder: knitr
biocViews: Annotation, RNASeq, Alignment, MultipleSequenceAlignment, QualityControl, Preprocessing, Sequencing, DataImport, GeneExpression
Imports: Rsubread (>= 2.14.2), Biostrings (>= 2.68.1), rtracklayer (>= 1.60.0), GenomicRanges (>= 1.52.0), tools, utils
Suggests:
    Rsamtools,
    BiocStyle,
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
URL: https://gitlab.com/mtekman/CustomGenome
BugReports: https://gitlab.com/mtekman/CustomGenome/issues
Config/testthat/edition: 3
jorainer commented 11 months ago

Hi, I'm just providing some thoughts/comments since I was also dealing a lot with Ensembl annotations.

1) you are downloading fasta and GTF sequences yourself from Ensembl - have you checked whether these would also be available in Bioconductor's AnnotationHub? Or, would it maybe make sense to use the BiocFileCache package for downloading/caching these resources? IMHO that would simplify this process.

The general idea to create custom annotations or add additional sequence/gene information to reference genomes is however really nice.

mtekman commented 11 months ago

Hello @jorainer, thanks for the comments!

The general idea to create custom annotations or add additional sequence/gene information to reference genomes is however really nice.

Exactly, I was surprised it wasn't already a package!


AnnotationHub

Ah that's why these reviews are golden. I wish I had heard about this before I implemented the backend for Ensembl downloads that I did.

library(AnnotationHub)
ah = AnnotationHub(cache = "/my/dest/cache")
ah[ah$species == "Mus musculus" &
     ah$dataprovider == "Ensembl" &
     ah$genome=="GRCm38" &
     ah$sourcetype %in% c("GTF", "ensembl")]

The problem I'm seeing is how to pair a given genome with a given GTF file? With Ensembl this is pretty unambiguous, but with AnnotationHub I'm overwhelmed with choice.

I think this is something I will implement into the next version of this tool, since right now this is very much geared towards Ensembl downloads.


BiocFileCache

I had never heard of this, and it sounds pretty useful for some future analyses. I have some angst related to how it checks for data consistency:

https://bioconductor.org/packages/release/bioc/vignettes/BiocFileCache/inst/doc/BiocFileCache.html

bfcneedsupdate() is a method that will check the local copy of the data’s etag and last_modifed time to the etag and last_modified time of the remote resource as well as an expires time. The cache saves this information when the web resource is initially added. The expires time is checked against the current Sys.time to see if the local resource has expired. If so the resource will deem need to be updated; if unavailable or not expired will check the etag and last_modified_time. The etag information is used definitively if it is available, if it is not available it checks the last_modified time. If the resource does not have a last_modified tag either, it is undetermined. If the resource has not been download yet, it is TRUE

It sounds like it doesn't do any consistency checks with remote checksums, which is a main indicator for me on whether a large file downloaded isn't truncated. Have you had any issues with this in your code?


Also, how does BiocFileCache interact with AnnotionHub? I have the impression that AnnotationHub has it's own caching.

lshep commented 10 months ago

BiocFileCache is used as the backend caching mechanism in AnnotationHub.

lshep commented 10 months ago

You'll need an inst/script data that describes how the data in inst/extdata was generated, and include any relevant source and licensing information for the data there.

bioc-issue-bot commented 10 months ago

Your package has been added to git.bioconductor.org to continue the pre-review process. A build report will be posted shortly. Please fix any ERROR and WARNING in the build report before a reviewer is assigned or provide a justification on why you feel the ERROR or WARNING should be granted an exception.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. All changes should be pushed to git.bioconductor.org moving forward. It is required to push a version bump to git.bioconductor.org to trigger a new build report.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot commented 10 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: macOS 12.7.1 Monterey: CustomGenome_0.99.0.tar.gz Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.0.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

lshep commented 10 months ago

Please correct ERROR, Warning, Notes from the build report before a reviewer is assigned for in depth review.

mtekman commented 9 months ago

@lshep Thanks, I've fixed the issue! Sorry for the long delay, it was a busy month

lshep commented 9 months ago

Please push the changes to git.bioconductor.org to trigger a new build report. see https://github.com/Bioconductor/Contributions/issues/3224#issuecomment-1819369522

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 5b002c807b9b8c1b5591e01bd3473c47e6eefb7d

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.1.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.1.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 17a2b45aa9cd1500f950ee794216c1253d217ad1

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: macOS 12.7.1 Monterey: CustomGenome_0.99.2.tar.gz Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.2.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: a20b98307ed8f4bd284a7a691d337a4e190a7fe5

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.3.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.3.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 4ad4318a3cef1df2d87ea278507046de3e0a0851

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.4.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.4.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 41a09577a658c8b6e5783c78b3dad502377a79e2

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: macOS 12.7.1 Monterey: CustomGenome_0.99.5.tar.gz Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.5.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 1d973511bf3aa102b4e5d390c63a4b2bf0b18ffa

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.6.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.6.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 036cc2b9ac273b5ec21b03509e6bafce23782da3

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "WARNINGS, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.7.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.7.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

mtekman commented 9 months ago

@lshep I'm not sure how to get around these errors. The main problem with them is that sometimes a file downloads incomplete and then the checksum fails, but sometimes this works fine (as is the case here: https://github.com/Bioconductor/Contributions/issues/3224#issuecomment-1870657807)

What do I need to do to get this to pass, other than removing more test checks that work fine on my local devices?

lshep commented 9 months ago

Do you implement any sort of caching so that if a file works it doesnt have to be redownloaded the next time? See BiocFileCache

mtekman commented 9 months ago

I don't, but the code does check to see if the file has already been downloaded to the temp folder. I think the main issue is the sheer size of these files.

Using BiocFileCache wouldn't fix this problem if I understand correctly, since if I run this N times over N push/commits

path <- tempfile()
bfc <- BiocFileCache(path, ask = FALSE)
bfcadd(bfc, "genome", fpath="https://www.ensembl.com/big/file.fasta.gz")
bfcadd(bfc, "othergenome", fpath="https://www.ensembl.com/big/file2.fasta.gz")
## later:
mygenome1 <- bfcquery(bfc, "genome")
mygenome2 <- bfcquery(bfc, "othergenome")

it would still initialise N different independent caches and still have to initially download all files that the start of each test.

This doesn't change much from my initial code which downloads to a fixed folder for genomes and re-uses the resources over subsequent calls.

Is there anything else I can do?

lshep commented 8 months ago

If the size of the files are large then caching will be a necessity and likely a requirement by a reviewer. I would recommend a package specific directory that we general allow in cases like this. See UniProt.ws example:

    cache <- tools::R_user_dir("UniProt.ws", "cache")
    bfc <- BiocFileCache(cache, ask=FALSE)

But also be very clear to the user how much space they will need to run example data / vignettes before the start -- understanding that it can be variable once used with other data.

bioc-issue-bot commented 8 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 864159a746289e7ca5b23bf12a1ea6b6c15b7115

bioc-issue-bot commented 8 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.8.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.8.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 8 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: a6a69c13a70649825f07ac8b37769633974ea3b1

bioc-issue-bot commented 8 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "TIMEOUT, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.9.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.9.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 8 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 630ed6ce4fe1602ac605460188da79966a45eaf5

bioc-issue-bot commented 8 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.2 LTS): CustomGenome_0.99.10.tar.gz macOS 12.7.1 Monterey: CustomGenome_0.99.10.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CustomGenome to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

mtekman commented 8 months ago

This version implements BiocFileCache (within get_genome_files) and retires the manual checksum based methods of confirming a cached file.

I see some warnings, but otherwise I think it's ready to be reviewed again.

bioc-issue-bot commented 8 months ago

A reviewer has been assigned to your package for an indepth review. Please respond accordingly to any further comments from the reviewer.

PeteHaitch commented 7 months ago

Hi @mtekman,

I started to review CustomGenome but it soon became clear that it needs some improvements before the review can proceed.

I always start my package review by working through the vignette, like a new user would when they encounter your package. The current vignette has almost no evaluated code chunks, as noted by BiocCheck():

BiocCheck::BiocCheck('CustomGenome_0.99.10.tar.gz')
# Omitting some output
* Checking vignette directory...
    * WARNING: Evaluate more vignette chunks.
        10 out of 13 code chunks = 76% unevaluated
        10 non-exec code chunk(s) (e.g., '```r')
# Omitting some more output

Instead, the vignette has static copy+pasted code and output, and the problem with that is it can easily become out of sync with the actual code in the package. Even the evaluated code chunks, like the very first one, highlight that the documentation has fallen out-of-sync with the package code:

library(CustomGenome)
data("user_sequences")
#> Warning in data("user_sequences"): data set 'user_sequences' not found

Continuing on, the next chunk contains invalid code:

homo_sap = get_genome_files(species="homo_sapiens",
                            output_folder="genomes")
#> Error in get_genome_urls(...): unused argument (output_folder = "genomes")

I corrected that error (using cache_folder instead of output_folder) and was then surprised to find that the code started downloading the genome files for Mus musculus rather than Homo sapiens!

homo_sap = get_genome_files(species="homo_sapiens",
                            cache_folder="genomes")
#> Downloading: Mus_musculus.GRCm39.105.gtf.gz                              

At this point I stopped my review because it's clear the package is not yet ready.

A few other comments from what I saw when reading the first part of the vignette:

Please take a read of https://contributions.bioconductor.org/ to ensure that your package complies with this advice. I'm going to mark this as 'pending pre-review changes'. We can restart the review once you have addressed these issues and confirmed that you've read https://contributions.bioconductor.org/ to ensure that your package complies with the advice therein.

Cheers, Pete

mtekman commented 7 months ago

Hi @PeteHaitch

Thanks a lot for the review, I guess I need to work a lot on the vignette. I somehow assumed that a reproducible vignette was not a neccessity, especially for large datasets -- but I can see that I neglected it so long that it did indeed fall out of sync with the code.

Cheers to @lshep for the R_user_dir suggestion, I somehow overlooked that comment when I reimplemented the caching.

I'll get back on this soon, thanks again

bioc-issue-bot commented 3 months ago

This issue is being closed because there has been no progress for an extended period of time. You may reopen the issue when you have the time to actively participate in the review / submission process. Please also keep in mind that a package accepted to Bioconductor requires a commitment on your part to ongoing maintenance.

Thank you for your interest in Bioconductor.