Bioconductor / AnnotationHub

Client for the Bioconductor AnnotationHub web resource
16 stars 13 forks source link

Invertebrate Ensembl ID #9

Closed kokitsuyuzaki closed 5 years ago

kokitsuyuzaki commented 5 years ago

Hi,

Nice package! Thanks for this package, I could remove the dependency of biomaRt, which is slow and unstable.

By the way, I found that when the species is not vertebrate, Ensembl ID cannot be retrieved from AnnotationHub.

For example, when OrgDb is about Homo sapiens, columns function returns "ENSEMBL", "ENSEMBLPROT", "ENSEMBLTRANS".

library("AnnotationHub")
ah <- AnnotationHub()

# Vertebrate (Homo sapiens)
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
columns(hs)

However, when the species is not vertebrate, "ENSEMBL", "ENSEMBLPROT", "ENSEMBLTRANS" are not available.

# EnsemblPlants: http://plants.ensembl.org/index.html
at <- query(ah, c("OrgDb", "Arabidopsis thaliana"))[[1]]
columns(at)

# EnsemblFungi : https://fungi.ensembl.org/index.html
sc <- query(ah, c("OrgDb", "Saccharomyces cerevisiae"))[[1]]
columns(sc)

# EnsemblMetazoa : https://metazoa.ensembl.org/index.html
ce <- query(ah, c("OrgDb", "Caenorhabditis elegans"))[[1]]
columns(ce)

# EnsemblProtists : https://protists.ensembl.org
lm <- query(ah, c("OrgDb", "Leishmania major"))[[1]]
columns(lm)

# EnsemblBacteria: https://bacteria.ensembl.org/index.html
pa <- query(ah, c("OrgDb", "Pseudomonas aeruginosa PAO1"))[[1]]
columns(pa)

Is this related to that these databases are separated as different databases from the original Ensembl database?

lshep commented 5 years ago

I believe this is the case but I don't actually generate the data files that are included in the Hub. Perhaps @dvantwisk or @abelew could confirm.

abelew commented 5 years ago

Greetings,

TLDR version first: For a few of the examples you provide, if you ask explicitly for the ensembl-sourced data (elegans, cerevisiae, and pa01), then you may ask for the ensembl keytypes. For the others, the source data does not have them, but does have ncbi-specific keytypes.

Here is the long version: In my own work, when dealing with Ensembl, I pretty much exclusively use biomaRt; but also set the host to one of the archive servers in an attempt to avoid the oddities which sometimes crop up. I will try getting some Ensembl material from AnnotationHub and see what I get...

Here is a function which tries to handle the various biomaRt corner cases: https://github.com/abelew/hpgltools/blob/master/R/annotation_biomart.r

you will likely observe that much of it is a series of try()s to check for biomaRt shenanigans.

ok, so while typing this, I performed the queries as shown above and I think I see the answer to your initial question.

Looking first at the Arabidopsis example, you are explicitly asking for the first Arabidopsis orgdb entry. When I did it on my computer, I just asked for all of them, when I looked at the first A.t. entry, I see that the data came from a mix of tair, ncbi, and ensembl; when I asked for the keys of that entry, I see that they are all tair IDs.

Looking at cerevisiae, I pick up 3 entries. The first one looks like it came from ensembl and has ensembltrans, among a few other interesting keytypes. The second and third look like ones I generated from fungidb.org and use their keytypes (thus you may pick up the ncbi IDs via the key 'ANNOT_GENE_ENTREZ_ID', interpro with 'ANNOT_INTERPRO_ID', pfam with 'ANNOT_PFAM_ID', etc.

elegans: one entry with a mix of IDs including the ensembl IDs.

major: I get 8 entries, 6 of which are various strains of L.major from tritrypdb.org, the other two look like they came from ncbi. The same keytypes are available as per the fungidb.org cerevisiae entry above.

strain pa01 (hey, I am playing with pa14 right now!): I do not get any hits if I use the query as you typed it above. But if I simplify it to just "Pseudomonas areuginosa" I get 2 hits. These do not explicitly say their source, but I would bet they came from ncbi given the keytypes returned.

Actually, this begs a question for @lshep: What should I add to my metadata to get nice SOURCEURL entries added to the eupathdb material like in the Arabidopsis example? (I bet the answer is in the AnnotationHubData, I will check there now).

I hope this helps, atb

kokitsuyuzaki commented 5 years ago

Ok, so you mean whether the Ensembl IDs can be retrieved is highly dependent on how the keywords are specified and if Ensembl IDs are actually not prepared in source data, I cannot get them, right?

There is one more question. Recently, I found that the AnnotationHub ID (e.g. AH12345) cannot be unique if the R version is different.

For example, when R version is v-3.5.0, OrgDb of Homo sapiens is registered as "AH66156", but when I updated the R version as v-3.6.0, OrgDb of Homo sapiens is registered as "AH70572".

library("AnnotationHub")
ah <- AnnotationHub()
query(ah, c("OrgDb", "Homo sapiens"))

I want to correctly specify the OrgDb of Homo sapiens in any environment, but I think, using AnnotationHub ID like below is not safe.

hs <- AnnotationHub()[["AH70572"]]

Do you think using the query function like below is safer?

hs <- query(ah, c("OrgDb", "Homo sapiens"))

What is your best practice?

Koki

lshep commented 5 years ago

orgDb we regenerate every release; so the most updated will have a new AH id - we have a plan to reuse AH ids for different versions of the same file but it is not implemented yet so each 6 month or so there will be an updated object in the Hub. The query then is safer if you do not want to updated your AH id every release, I would probably add the title to the query as well if you know it query(ah, c("OrgDb", "Homo sapiens", "org.Hs.eg.db.sqlite"))

kokitsuyuzaki commented 5 years ago

I learned a lot. Thanks!

guisantagui commented 4 years ago

Hi, I saw above that when you do query(ah, c("OrgDb", "Pseudomonas aeruginosa PAO1")) or query(ah, c("OrgDb", "Pseudomonas aeruginosa")) You get some hits. I am getting none. I'm trying to do a functional enrichent with enrichGO of some differential gene expression analysis I did with DESeq2 of P aeruginosa's genes, but since I can't find any OrgDb of Pseudomonas aeruginosa I'm kind of stuck. Do you know where can I find an OrgDB of this organism or if I have other alternatives of how to do it? The gene list I have are gene symbols (gene names). Thanks!

lshep commented 4 years ago

OrgDB's are only valid for a given Bioconductor release. I have yet to add the non-standard orgDbs into AnnotationHub for 3.11/3.12 . I plan on working on this today/tomorrow. Hopefully once they are added it will appear. I will post back after the upload is complete.

For clarification and completeness: can you include your sessionInfo() as well.

guisantagui commented 4 years ago

Hi, here is my session info. `sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] LC_COLLATE=Catalan_Spain.1252 LC_CTYPE=Catalan_Spain.1252 LC_MONETARY=Catalan_Spain.1252 [4] LC_NUMERIC=C LC_TIME=Catalan_Spain.1252

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] KEGGREST_1.26.1 MeSH.Pae.PAO1.eg.db_1.13.0 MeSHDbi_1.22.0 AnnotationHub_2.18.0
[5] BiocFileCache_1.10.2 dbplyr_1.4.3 clusterProfiler_3.14.3 org.Sc.sgd.db_3.10.0
[9] reactome.db_1.70.0 topGO_2.38.1 SparseM_1.78 GO.db_3.10.0
[13] AnnotationDbi_1.48.0 IRanges_2.20.2 S4Vectors_0.24.4 Biobase_2.46.0
[17] graph_1.64.0 BiocGenerics_0.32.0 gage_2.36.0 RCurl_1.98-1.1
[21] RGtk2_2.20.36 FGNet_3.20.0

loaded via a namespace (and not attached): [1] fgsea_1.12.0 colorspace_1.4-1 hwriter_1.3.2 ellipsis_0.3.0
[5] ggridges_0.5.2 qvalue_2.18.0 XVector_0.26.0 rstudioapi_0.11
[9] farver_2.0.3 urltools_1.7.3 graphlayouts_0.7.0 ggrepel_0.8.2
[13] bit64_0.9-7 interactiveDisplayBase_1.24.0 fansi_0.4.1 xml2_1.3.2
[17] splines_3.6.3 R.methodsS3_1.8.0 GOSemSim_2.12.1 polyclip_1.10-0
[21] jsonlite_1.6.1 png_0.1-7 R.oo_1.23.0 shiny_1.4.0.2
[25] ggforce_0.3.1 BiocManager_1.30.10 compiler_3.6.3 httr_1.4.1
[29] rvcheck_0.1.8 fastmap_1.0.1 assertthat_0.2.1 Matrix_1.2-18
[33] cli_2.0.2 later_1.0.0 tweenr_1.0.1 htmltools_0.4.0
[37] prettyunits_1.1.1 tools_3.6.3 igraph_1.2.5 gtable_0.3.0
[41] glue_1.4.0 reshape2_1.4.4 DO.db_2.9 dplyr_0.8.5
[45] rappdirs_0.3.1 fastmatch_1.1-0 Rcpp_1.0.4.6 enrichplot_1.6.1
[49] vctrs_0.2.4 Biostrings_2.54.0 ggraph_2.0.2 stringr_1.4.0
[53] mime_0.9 lifecycle_0.2.0 XML_3.99-0.3 DOSE_3.12.0
[57] europepmc_0.3 zlibbioc_1.32.0 MASS_7.3-51.5 scales_1.1.0
[61] tidygraph_1.1.2 promises_1.1.0 hms_0.5.3 RColorBrewer_1.1-2
[65] yaml_2.2.1 curl_4.3 memoise_1.1.0 gridExtra_2.3
[69] ggplot2_3.3.0 triebeard_0.3.0 stringi_1.4.6 RSQLite_2.2.0
[73] BiocVersion_3.10.1 plotrix_3.7-8 BiocParallel_1.20.1 rlang_0.4.5
[77] pkgconfig_2.0.3 matrixStats_0.56.0 bitops_1.0-6 lattice_0.20-38
[81] purrr_0.3.4 cowplot_1.0.0 bit_1.1-15.2 tidyselect_1.0.0
[85] plyr_1.8.6 magrittr_1.5 R6_2.4.1 DBI_1.1.0
[89] pillar_1.4.3 tibble_3.0.0 crayon_1.3.4 viridis_0.5.1
[93] progress_1.2.2 grid_3.6.3 data.table_1.12.8 blob_1.2.1
[97] digest_0.6.25 xtable_1.8-5 httpuv_1.5.2 tidyr_1.0.2
[101] gridGraphics_0.5-0 R.utils_2.9.2 munsell_0.5.0 viridisLite_0.3.0
[105] ggplotify_0.0.5 `

abelew commented 4 years ago

Hello @guisantagui, there are a few workaround I think you may employ as well:

  1. You may create orgdb databases via AnnotationForge::makeOrgPackageFromNCBI() For example:

    ## Note this takes a while the first time it is run (it warns you I think)
    test <- AnnotationForge::makeOrgPackageFromNCBI(version="0.1",
    author="test <atb@test.org>",
    maintainer="test <atb@test.org>",
    outputDir=".",
    tax_id="287",
    genus="Pseudomonas",
    species="aeruginosa")

    At least on my computer (R 3.6.3) that provided an orgdb instance which is acceptable to clusterProfiler.

    1. I think you may still feed clusterProfiler arbitrary dataframes of categories and genes via the enricher() function. I have not tested this though, so I may be wrong.

    2. You can definitely feed topGO, goseq, and gostats arbitrary dataframes of ontology information.

    The last solutions may prove nice if you would like to use the annotations from sources like microbesonline.org; if you wish I can send you some example invocations of topGO and friends.

lshep commented 4 years ago

I agree with the above solution as well. When we add non-standard orgDb's into AnnotationHub we only do a subsection of 1000 species so there is no guarantee that the species will remain in the hub. The above function is what we use to generate the orgDbs we include and can be utilized for any organism on NCBI.

abelew commented 4 years ago

But please don't trust the tax_id I used (287) because I just grabbed the first one I saw, it might not be right for the strain you are actually using.