Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Optimizations and allow specification of the desired Ensembl release #46

Closed knokknok closed 1 year ago

knokknok commented 1 year ago
knokknok commented 1 year ago

Any chance this could be pulled before 3.17? 🙏

lshep commented 1 year ago

Thanks for the PR. We will try to get to evaluating this as soon as possible. @jmacdon if you have any comments or want to review too that would be fantastic.

jmacdon commented 1 year ago

@knokknok and @lshep I do have comments. I meant to post last week but didn't have time.

This PR doesn't work correctly if an Ensembl version is not specified. What is meant to happen is if the release version is not specified, the helper function getFastaSpeciesDirs selects the current release version. However, the version isn't returned by that function. This could be fixed by changing the last line for getFastaSpeciesDirs to

return(list(res = res, release = release))

And then changing the beginning of available.FastaEnsemblSpecies

if (is.null(speciesDirs) || is.null(release) {
         vals <- getFastaSpeciesDirs(release = release)
         speciesDirs <- vals$res
         release <- vals$release
}

Otherwise it appears to work as intended.

knokknok commented 1 year ago

Dear @lshep and @jmacdon, Thanks for the comments. However, using version=NULL in useEnsembl automatically selects the current release. The actual value is only needed in getFastaSpeciesDirs because otherwise, the correct directory on the FTP sever is not known. So, if one wants the current release, it should not be necessary to return the value from getFastaSpeciesDirs.

jmacdon commented 1 year ago

@knokknok Oh, right. That's the other thing I meant to address. If release = NULL you also need to return the release version from available.ensembl.datasets. In .getEnsemblData you have this bit of code.

datSets <- available.ensembl.datasets(taxId, release=release)
    datSet <- datSets[names(datSets) %in% taxId]
    ens <- biomaRt::useEnsembl('ensembl', datSet, version=release)
    colName <- if (as.integer(release)>=97) "entrezgene_id" else "entrezgene"

And while useEnsembl doesn't care if you provide a NULL value for release, that if statement sure does.

knokknok commented 1 year ago

@jmacdon

And while useEnsembl doesn't care if you provide a NULL value for release, that if statement sure does.

Indeed! But it might be simpler to just replace with: colName <- if (is.null(release) || as.integer(release)>=97) "entrezgene_id" else "entrezgene" Since the current release will always be >=97.

jmacdon commented 1 year ago

@lshep I am OK with merging this PR. I'll look into putting a tryCatch around the call to useEnsembl to hopefully harden against failures so late in the build process.

lshep commented 1 year ago

@knokknok sorry for the delay on this. When I'm trying to merge I think there is a conflict with some of the recent changes that were included. Could you merge the current devel branch into yours and fix any merge conflicts?

knokknok commented 1 year ago

@Ishep It's weird since when I synced with Bioconductor:devel, I got no conflict... Can you try again?

lshep commented 1 year ago

Thank you that worked. and thank you again for this PR !

knokknok commented 1 year ago

🙏