EnsemblGenomes - Githubissues

js2264 commented 1 month ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/js2264/EnsemblGenomes

Confirm the following by editing each check box to '[x]'

[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand Bioconductor Package Naming Policy and acknowledge Bioconductor may retain use of package name.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
[x] I am familiar with the Bioconductor code of conduct and agree to abide by it.

I am familiar with the essential aspects of Bioconductor software management, including:

[x] The 'devel' branch for new packages and features.
[x] The stable 'release' branch, made available every six months, for bug fixes.
[x] Bioconductor version control using Git (optionally via GitHub).

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 1 month ago

Hi @js2264

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: EnsemblGenomes
Title: Rapid access to Ensembl-provided genome reference and annotation files
Description: EnsemblGenomes scrapes ensembl.org and ensemblgenomes.org FTP 
    servers to locate genome reference fasta files and genome annotation 
    gff3 files provided for species supported by Ensembl. As of July 2024, 
    this corresponds to more than 300 vertebrate, 300 metazoa, 200 protists, 
    150 plants, 1,000 fungi and 30,000 bacteria species. Rather than supporting
    `BiocFileCache`, EnsemblGenomes simply intends to retrieve and list URL 
    of fasta and gff3 files across Ensembl releases as a plain data frame. 
Version: 0.99.0
Date: 2024-07-26
Authors@R: 
    person("Jacques", "Serizay", , "jacquesserizay@gmail.com", role = c("aut", "cre"))
License: MIT + file LICENSE
URL: https://github.com/js2264/EnsemblGenomes
BugReports: https://github.com/js2264/EnsemblGenomes/issues
biocViews: 
    Software, 
    Sequencing
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
Depends: 
    R (>= 4.3.0)
Imports: 
    httr, 
    rvest, 
    glue, 
    tibble, 
    dplyr, 
    stringr, 
    cli
Suggests: 
    knitr,
    quarto,
    rmarkdown,
    sessioninfo,
    testthat (>= 3.0.0),
    BiocStyle,
    RefManageR
Config/testthat/edition: 3
VignetteBuilder: knitr
LazyData: false

js2264 commented 1 month ago

The package does not rely on any Bioconductor class per se, but is related to genomics projects. The idea here is (1) not to have to rely on AnnotationHub to recover references/annotations, but instead directly scrape Ensembl FTP servers, and (2) not provide any caching function, but instead directly expose the URL to locally download the file. I think this approach is helpful to people who are not necessarily versed in AnnotationHub or BiocFileCache, but just want to get a pair of files here and now. Do let me know if you think this should not belong to Bioconductor.

lshep commented 1 month ago

I still think it might be helpful to cache files when downloaded in some way shape or form. If the web service is down, which we have experienced before with ensembl API's, the package will fail, and the end user will not be able to work even if they had downloaded a resource previously. I like the idea of an on demand but would still strongly suggest a caching mechanism.

It has been also noted that in certain environments, access to external resources via FTP protocol can be blocked by institutions or firewalls. Is that a concern at all?

The vignette is very minimal. I run your functions and get down to the last table

> list_ensembl_files('amphiprion_percula', release = 'release-100')
ℹ Scanning taxons [release-100]...
✔ Taxon found: vertebrate [release-100]
# A tibble: 4 × 8
  date             release     taxon   collection species type  url   url_status
  <chr>            <chr>       <chr>   <lgl>      <chr>   <chr> <chr>      <int>
1 2020-03-04 22:28 release-100 verteb… NA         amphip… refe… http…        200
2 2020-03-04 22:28 release-100 verteb… NA         amphip… refe… http…        200
3 2020-03-04 22:28 release-100 verteb… NA         amphip… refe… http…        200
4 2020-03-16 00:31 release-100 verteb… NA         amphip… anno… http…        200

now what? How do I actually download or retrieve the files?

And while the package itself may not rely on Bioconductor classes per se... it would be nice to show how to convert the downloaded files into a Bioconductor class structure for those that are familiar and to show how to seamlessly integrate into the Bioconductor ecosystem if desired.

js2264 commented 1 month ago

Thank for your feedback. I'll add support for caching through BiocFileCache, and add extra information on how to import the downloaded files into Bioconductor workflows.

Regarding access via FTP protocol, I don't think this should be an issue. All the queries are done through HTTP requests, and the recovered URLs are https://....

I'll ping you when I have implemented the mentioned features.

lshep commented 1 month ago

excellent. thank you

lshep commented 1 week ago

@js2264 just checking back in. Any progress?

js2264 commented 6 days ago

Hi @lshep sorry I didn't have any time to work on this. Is it a problem to leave the issue open until I have more time to spend on this? I hope to get some time to work on that in ~ a week/10 days...

lshep commented 6 days ago

yes. thank you for the update

Bioconductor / Contributions

EnsemblGenomes #3499