Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Problem with ENTREZID in org.Mxanthus.eg.db #8

Closed eduardoillueca closed 4 years ago

eduardoillueca commented 4 years ago

Hi,

I have created an OrgDb package for the organism Myxococcus xanthus DK1622 and now I am working to publish it in Bioconductor, but we have found a problem. In OrgDb packages avaliable in Bioconductor, the key column is called ENTREZID but the method AnnotationForge::makeOrgPackage() wants you to call the central column GID. So the problem is that I have not publish my packages with the same terminology than the other OrgDb packages. Is there any way to solve this? Thanks very much,

Eduardo Illueca

hpages commented 4 years ago

@jmacdon, @dvantwisk, @Kayla-Morrell, @vobencha, @dtenenba, @mrjc42 Can someone with some experience with AnnotationForge::makeOrgPackage() please help Eduardo with this?

@eduardoillueca It will help a lot if you could provide a minimal reproducible example + sessionInfo(). Thanks!

jmacdon commented 4 years ago

There isn't anything here to be solved. makeOrgPackage is intended for making arbitrary OrgDb packages, for which the 'central ID' may not be an NCBI Gene ID. GID (which one could argue is confusing, given NCBI's GI number) is just an ID that is central to the annotation, and can be a Gene ID if you like.

In fact, the example for makeOrgPackage uses finch data, and uses Gene IDs as the GID:

> head(fSym)
     GID SYMBOL                                                 GENENAME
1 751582   SNCA synuclein, alpha (non A4 component of amyloid precursor)
eduardoillueca commented 4 years ago

Thanks @jmacdon for your answer. I think the problem is that Myxococcus xanthus is not a supported organism .db0. In adittion, in the NCBI there is not the necessary information to use makeOrgPackageFromNCBI. So, this is the reason that the central column must be GIDand the DB SCHEMAmust be NOSCHEMA

@hpages Yes, I have thought the same and I am preparing a script and some data tha I will share in GitHub when I finish it.

jmacdon commented 4 years ago

@eduardoillueca I am not sure I understand what you mean. Yes, that species has no db0 (true for most species), and it may well be that NCBI doesn't have enough information to use makeOrgPackageFromNCBI. But that isn't why the central column is GID. The reason is because not all species are annotated by NCBI, and thus it makes no sense to force people to call the central ID ENTREZID, particularly if there are no NCBI Gene IDs for that species.

So GID is a generalized term for Gene ID that has no connotation that it is (what was formerly called) an Entrez Gene ID.

eduardoillueca commented 4 years ago

Thanks @jmacdon

Yes, I undestarnd. So I think we cannot change the GID column. This is the repository with an reporducible example and data:

[]https://github.com/eduardoillueca/ProbeAnnotationHub

hpages commented 4 years ago

@jmacdon But isn't there a strong expectation that an org.XXX.eg.db package has an ENTREZID column?

jmacdon commented 4 years ago

@hpages I think I am missing some of the back story here. Is this package intended to be available for download? I can't imagine why we would do such a thing. Instead this should be pushed to the AnnotationHub (which all things equal all the other OrgDb packages should be as well).

lshep commented 4 years ago

Yes it has been made into an AnnotationHub package but was going to mimic how the other orgDb packages work. We require that all hub resources (annotation and experiment) have a base package associated with it for where to report errors and as a reference. are you suggestion maybe it shouldn't mimic how a standard orgDb package works?

lshep commented 4 years ago

I guess a naive question too - if we expect org.xxx.eg.db to have ENTREZID - while it would be duplicated information, would it be possible to just add it in and have both GID and ENTREZID?

hpages commented 4 years ago

[Thanks Lori for chiming in. Was about to post the comment below when I saw your comment. Still posting it.]

I'm also lacking some context. The problem discussed here came up during the submission process: https://github.com/Bioconductor/Contributions/issues/1292

My understanding is that there was some previous discussion between Eduardo, Martin (@mtmorgan), and maybe Lori (@lshep) about the best way to contribute these annotations to Bioconductor. I don't know the reason for submitting this as a package and for naming the package org.Mxanthus.eg.db. Hopefully AnnotationHub experts Martin or Lori can chime in. I just noticed that it was an "eg" package with a central id that was not ENTREZID. My fault for causing so much trouble, sorry!

lshep commented 4 years ago

The discussion was to start recommending all packages to start using the hubs and to slowly start converting (some) of our traditional annotation packages to do the same as well...

jmacdon commented 4 years ago

@lshep There seem to be any number of OrgDb packages on the AnnotationHub that don't have any corresponding downloadable packages? I see around 2000

> z <- query(hub, c("orgdb", "sqlite"))
z
> AnnotationHub with 2034 records

Are you saying that the submitter has to generate an installable package that is used only internally? If so, whether or not the GID is a Gene ID or something else is probably not relevant (where I am assuming that the relevance is due to end users being confused by installing an org.XXX.eg.db package that doesn't have an ENTREZID column). I mean, here's an OrgDb from the hub:

> rando <- hub[["AH74621"]]
downloading 0 resources
loading from cache
> rando
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Trypanosoma cruzi
| SPECIES: Trypanosoma cruzi
| CENTRALID: GID
| Taxonomy ID: 5693
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
> mcols(z)["AH74621","rdatapath"]
[1] "EuPathDB/OrgDb/3.9/org.Tcruzi.TCC.v42.eg.sqlite"
> head(keys(rando))
[1] "C3747_1005g2"  "C3747_100g10"  "C3747_100g100" "C3747_100g101"
[5] "C3747_100g102" "C3747_100g103"

That's definitely a non-standard OrgDb, and it's called org.Tcruzi.TCC.v42.eg.sqlite, where a neophyte might think there are ENTREZIDs in there somewhere, when they are not.

lshep commented 4 years ago

Its a newer policy to be associated with a package (ExperimentHub always had and we think its a good policy)- These are the ones added by the core at release - as we start to transition I think the plan was to make one that would represent it - That is a good point the non-standard orgDbs do not necessarily have an ENTREZID so its not universal.

hpages commented 4 years ago

Neophytes don't have expectations, only long-time grumpy Bioconductor users ;-)

hpages commented 4 years ago

@lshep Why not let this one go as a standalone SQLite file until "we" (the core team, not me in particular) have made one? My feeling is that it will make things easier for everybody if "we" lead this transition, announce it, and provide documentation for it, rather than expecting contributors to figure out things by themselves.

jmacdon commented 4 years ago

There is a valid argument (that I am about to make) that we shouldn't actually have any specificity as to what the central ID is. Sure, most of them will be NCBI Gene IDs because we're 'mercans and stuff (well not @hpages but we still accept him ;-D), but particularly as we get more and more non-model organisms, maybe it's a better idea to adopt the idea of a GID that is simply a pertinent ID for that organism, and that we put what the GID is in the metadata table.

Using my example above, I have no idea what the GIDs in that DB are, without actually Googling one to see what pops up.

mtmorgan commented 4 years ago

My feeling is that opportunities for versioning and documentation with a package wrapping access to AnnotationHub resources is a plus compared to an AnnotationHub resource without documentation. I also here 'perfection is the enemy of progress' somewhere in my head. So my vote still for creating a package.

If the .eg part of the name is problematics (sounds like it is!) then why not use gid or whatever the central identifier is, and document this in the show method and exploit the opportunity provided by package infrastructure ;) to delve into the details on a help page page, etc? It could save Jim some googling.

Jim mentioned 'internal' use, but the package would be distributed as an annotation package and available to users, etc. and in particular for other package authors to indicate dependencies on particular resources.

eduardoillueca commented 4 years ago

Thanks @lshep @hpages @jmacdon @mtmorgan for your comments. I have read all carefully and I will comment the following points:

hpages commented 4 years ago

@mtmorgan Is the plan to produce 2034 org packages in the future? Or to wrap batches of org SQLite files in multi-org packages? Or...?

@eduardoillueca

The GID is only a number that identifies each gene due to the fact that there isn't any identifier in the NCBI. It hasn't any mean

Good to know. So the package should definitely not be called org.Mxanthus.eg.db!

I believe this creates a precedent though. FWIW:

> org_pkgs <- BiocManager::available("org\\..*\\.db")
> matrix(unlist(strsplit(org_pkgs, ".", fixed=TRUE)), ncol=4, byrow=TRUE)
      [,1]  [,2]      [,3]     [,4]
 [1,] "org" "Ag"      "eg"     "db"
 [2,] "org" "At"      "tair"   "db"
 [3,] "org" "Bt"      "eg"     "db"
 [4,] "org" "Ce"      "eg"     "db"
 [5,] "org" "Cf"      "eg"     "db"
 [6,] "org" "Dm"      "eg"     "db"
 [7,] "org" "Dr"      "eg"     "db"
 [8,] "org" "EcK12"   "eg"     "db"
 [9,] "org" "EcSakai" "eg"     "db"
[10,] "org" "Gg"      "eg"     "db"
[11,] "org" "Hs"      "eg"     "db"
[12,] "org" "Mm"      "eg"     "db"
[13,] "org" "Mmu"     "eg"     "db"
[14,] "org" "Pf"      "plasmo" "db"
[15,] "org" "Pt"      "eg"     "db"
[16,] "org" "Rn"      "eg"     "db"
[17,] "org" "Sc"      "sgd"    "db"
[18,] "org" "Ss"      "eg"     "db"
[19,] "org" "Xl"      "eg"     "db"

with central id TAIR for org.At.tair.db and ORF for org.Pf.plasmo.db and org.Sc.sgd.db.

For the org.*.eg.db packages, the central id is either EG (according to the CENTRALID field) or ENTREZID (as reported by columns(org.Hs.eg.db)).

So the situation is more complicated (dare I say "messy"?) than what I thought.

An alternative to org.Mxanthus.gid.db would be to just get rid of the 3rd part: org.Mxanthus.db No need to make the name artificially longer if that doesn't convey additional information.

mtmorgan commented 4 years ago

I don't think there would be value in 2034 org packages, no.