Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Unnecessary queries? #45

Closed knokknok closed 1 year ago

knokknok commented 1 year ago

In .getEnsemblData why isn't the requested taxId passed to available.ensembl.datasets? That would avoid many long queries to .ensemblMapsToEntrezId when creating one package.

jmacdon commented 1 year ago

There are two steps to making an OrgDb package with makeOrgPackagFromNCBI. The first is to generate a SQLite database that contains the available mappings for all the data downloaded from NCBI. The second step is to use SQL queries to that database in order to generate the organism-specific package.

The benefit of doing it this way is that you can generate multiple OrgDb packages quickly, using the omnibus database that was generated. The downside is that it takes longer to make the omnibus database in the first place.

In this context, passing in just the one taxid to available.ensembl.datasets isn't useful because the goal is to generate a table that has all available NCBI Gene ID -> Ensembl Gene ID mappings in the omnibus database, not just for the one species.

knokknok commented 1 year ago

The issue is that available.ensembl.datasets frequently fails for me after a random number of TaxIDs:

TaxID: 80966
TaxID: 9646
TaxID: 61819
TaxID: 80972
Erreur dans h(simpleError(msg, call)) :
  erreur d'�valuation de l'argument 'table' lors de la s�lection d'une m�thode pour la fonction '%in%' : Timeout was reached: [www.ensembl.org:443] Operation timed out after 10001 milliseconds with 405911 bytes received

Am I wrong in saying that the Ensembl data is only added for the requested tax_id? If this is so, could the query be limited to that tax_id (and added to the cache/tables as needed)?

jmacdon commented 1 year ago

You are wrong in saying the Ensembl data is only added for the requested taxid, which I believe is what I already explained previously. You can set options(timeout = 1e5) to eliminate the timeout issue.

knokknok commented 1 year ago

All the data from NCBI is put into the NCBI.sqlite database but the data from Ensembl is stored only in the org.db package and only the data from the requested tax_id is downloaded. The calls to available.ensembl.datasets for all taxids is therefore wasteful when only a subset of org.db packages are created.

(also, the "timeout" option is different from the one that triggers the error)

jmacdon commented 1 year ago

Ah, yes you are correct. Given your familiarity with the code, please feel free to submit a patch.

knokknok commented 1 year ago

Ok, will look into it. Thanks!