Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

First pass at removing biomaRt #57

Open grimbough opened 3 months ago

grimbough commented 3 months ago

Every release we run into issue with the package failing to connect to the Ensembl Mart. Being realistic the Ensembl BioMart is not going to get any more reliable, and this is a continuing pain point (especially @lshep).

It looks like for the most past all it needs to do is download a mapping table between the Entrez IDs and Ensembl IDs. Ensembl provide the relevant tables on the FTP site for most organisms. This pull requests represents a first effort at removing the depdendence on biomaRt and just downloading the relevant files from FTP. It should be consider a draft right now and is barely tested, but I thought it might be a useful starting point. I can spend some more time refining it in the next few days.

grimbough commented 3 months ago

I've tidied up the code a bit now, and think this is now ready for review. Hopefully it's useful.

Given the current increase desire to move away from some of the legacy CRAN pacakges, I took the liberty of removing the depdence on the {RCurl} package, and replaced the relevant code with things from {httr2} or {curl}. You should be able to role that change back by ommitting https://github.com/Bioconductor/AnnotationForge/pull/57/commits/98fa828dcadc5a5c6aa5bdf9932de91b389ff042

Removing {XML} in favour of {xml2} is a more involved process as it appears quite a bit more, but I think could also be done without too much effort if desired.

hpages commented 3 months ago

Thanks Mike.

Is httr2::request(url) |> httr2::req_method("HEAD") |> httr2::req_error(is_error = ~ FALSE) |> httr2::req_perform() |> httr2::resp_is_error() basically equivalent to httr::status_code(httr::HEAD(url)) >= 400 or is the apparent additional complexity of the former trying to handle situations that the latter does not handle properly?

Personally not a big fan of the |> idiom in package code. Also httr2 brings with it a bunch of other tidyverse deps (magrittr, R6, cli, glue, rappdirs, lifecycle, rlang) whereas httr only brings itself and R6 to the party and seems to cover AnnotationForge's needs with simpler syntax (IMO).

grimbough commented 3 months ago

Yes, I think those two calls are pretty much equivalent. You can actually simplify the httr2 version a bit, if you're happy with httr2 converting the HTTP error code into an R error and reporting that. I added some of the additional complexity so it would keep the exiting 'url not found' style error message.

url <- 'https://httpstat.us/404'
httr2::request(url) |> 
  httr2::req_method("HEAD") |> 
  httr2::req_perform() 
#> Error in `httr2::req_perform()`:
#> ! HTTP 404 Not Found.

Personally I find building the request with httr2 much easier compared to the approach in httr of just sticking loads of things in the config and ... arguments but for a simple case like this it's probably overkill.

I was also motivated by the fact that the httr vignette now also reads "httr is no longer recommended for new code: we recommend httr2.", although I don't think it will be going away anytime soon with the number of dependents it has.

I must admit I don't spend too much time looking at the dependency list. I tend to assume most people have dplyr installed, and then you have a large chunk of the tidyverse already there. Also, since a core Biocondutor package (BiocCheck) uses httr2 then I always have all the depdencies installed pretty much from the moment I start a clean R library. Not true for all users though.

I'm also fine with using |> in a package, but this your codebase, so your style should apply.

All this goes to say that I'm not wedded to httr2, it's just my personal preference, and it'd be fine to use httr instread. Hopefully the strategy to use the files from the FTP site is sound, and then I can pretty easily make the switch to httr if that's your preference.

hpages commented 3 months ago

Yeah, if you don't mind switching to httr, that would be great. Pulling stuff from the FTP site is ok. Thanks again.

hpages commented 3 months ago

Thx Mike. Code is slightly simpler and more readable with httr. Also good to know about httr::http_error

About this

    if(length(datSet) == 0) {
        ## TODO: Is this a sensible return value?  Should it be an error?
        return(NULL)
    } else {

in the .getEnsemblData() function.

Not familiar with the code in makeOrgPackageFromNCBI.R but it seems that .getEnsemblData() is expected to return a data.frame otherwise bad things will happen downstream. So it should probably fail hard (possibly with a useful error message) instead of returning a NULL when length(datSet) == 0.

What do you think @lshep?

Thanks again for helping us with this @grimbough. Much appreciated!

H.

grimbough commented 3 months ago

Thanks @hpages I've updated that bit of code to throw an error with a (hopefully) informative message.

lshep commented 2 months ago

sorry I know this has been waiting on me. I'll try and look at it over the next week to test it out

grimbough commented 2 months ago

No rush, I was just going through and making sure all my repositories were up-to-date after the release.