Bioconductor / Contributions

Contribute Packages to Bioconductor
133 stars 33 forks source link

EnsemblGenomes #3499

Open js2264 opened 1 month ago

js2264 commented 1 month ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 1 month ago

Hi @js2264

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: EnsemblGenomes
Title: Rapid access to Ensembl-provided genome reference and annotation files
Description: EnsemblGenomes scrapes ensembl.org and ensemblgenomes.org FTP 
    servers to locate genome reference fasta files and genome annotation 
    gff3 files provided for species supported by Ensembl. As of July 2024, 
    this corresponds to more than 300 vertebrate, 300 metazoa, 200 protists, 
    150 plants, 1,000 fungi and 30,000 bacteria species. Rather than supporting
    `BiocFileCache`, EnsemblGenomes simply intends to retrieve and list URL 
    of fasta and gff3 files across Ensembl releases as a plain data frame. 
Version: 0.99.0
Date: 2024-07-26
Authors@R: 
    person("Jacques", "Serizay", , "jacquesserizay@gmail.com", role = c("aut", "cre"))
License: MIT + file LICENSE
URL: https://github.com/js2264/EnsemblGenomes
BugReports: https://github.com/js2264/EnsemblGenomes/issues
biocViews: 
    Software, 
    Sequencing
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
Depends: 
    R (>= 4.3.0)
Imports: 
    httr, 
    rvest, 
    glue, 
    tibble, 
    dplyr, 
    stringr, 
    cli
Suggests: 
    knitr,
    quarto,
    rmarkdown,
    sessioninfo,
    testthat (>= 3.0.0),
    BiocStyle,
    RefManageR
Config/testthat/edition: 3
VignetteBuilder: knitr
LazyData: false
js2264 commented 1 month ago

The package does not rely on any Bioconductor class per se, but is related to genomics projects. The idea here is (1) not to have to rely on AnnotationHub to recover references/annotations, but instead directly scrape Ensembl FTP servers, and (2) not provide any caching function, but instead directly expose the URL to locally download the file. I think this approach is helpful to people who are not necessarily versed in AnnotationHub or BiocFileCache, but just want to get a pair of files here and now. Do let me know if you think this should not belong to Bioconductor.

lshep commented 1 month ago

I still think it might be helpful to cache files when downloaded in some way shape or form. If the web service is down, which we have experienced before with ensembl API's, the package will fail, and the end user will not be able to work even if they had downloaded a resource previously. I like the idea of an on demand but would still strongly suggest a caching mechanism.

It has been also noted that in certain environments, access to external resources via FTP protocol can be blocked by institutions or firewalls. Is that a concern at all?

The vignette is very minimal. I run your functions and get down to the last table

> list_ensembl_files('amphiprion_percula', release = 'release-100')
ℹ Scanning taxons [release-100]...
✔ Taxon found: vertebrate [release-100]
# A tibble: 4 × 8
  date             release     taxon   collection species type  url   url_status
  <chr>            <chr>       <chr>   <lgl>      <chr>   <chr> <chr>      <int>
1 2020-03-04 22:28 release-100 verteb… NA         amphip… refe… http…        200
2 2020-03-04 22:28 release-100 verteb… NA         amphip… refe… http…        200
3 2020-03-04 22:28 release-100 verteb… NA         amphip… refe… http…        200
4 2020-03-16 00:31 release-100 verteb… NA         amphip… anno… http…        200

now what? How do I actually download or retrieve the files?

And while the package itself may not rely on Bioconductor classes per se... it would be nice to show how to convert the downloaded files into a Bioconductor class structure for those that are familiar and to show how to seamlessly integrate into the Bioconductor ecosystem if desired.

js2264 commented 1 month ago

Thank for your feedback. I'll add support for caching through BiocFileCache, and add extra information on how to import the downloaded files into Bioconductor workflows.

Regarding access via FTP protocol, I don't think this should be an issue. All the queries are done through HTTP requests, and the recovered URLs are https://....

I'll ping you when I have implemented the mentioned features.

lshep commented 1 month ago

excellent. thank you

lshep commented 1 week ago

@js2264 just checking back in. Any progress?

js2264 commented 6 days ago

Hi @lshep sorry I didn't have any time to work on this. Is it a problem to leave the issue open until I have more time to spend on this? I hope to get some time to work on that in ~ a week/10 days...

lshep commented 6 days ago

yes. thank you for the update