Poorly annotated plants

Fred-White94 commented 3 years ago

I just wanted to point out that using clusterprofiler with OrgDb objects is not ideal for less well annotated species. This is the case where the OrgDb comes from AnnotationHub.

This includes rice for example. The issue is with OrgDb not having translations from EntrezIDs to GO terms ~75% of the input EntrezIDs do not map to GO terms through this method. Since the OrgDb object does not have an ensembl keytype I was forced to translate using biomart from ensembl to entrez. This also loses some IDs. A direct translation from ensembl to GO terms leads to only ~39 % non-mapping genes. I am unaware of a method to update OrgDb objects with, for example, new keyTypes. But need to look into it as this clusterprofiler method for GSEA is unusable for lesser annotated species.

I have not tried creating an OrgDb from ncbi, but I would not recommend using AnnotationHub for anyhting other than arabisopsis/human

lshep commented 3 years ago

The OrgDb's in AnnotationHub are generated from NCBI data. The main function used for generation comes from AnnotationForge::prepareDataFromNCBI .

lshep commented 3 years ago

We provide 1000+ OrgDbs for users; to say that it should be limited use to arabisopsis/human is a little harsh. We also try not to duplicate data, there is a separate community contributed package called AHEnsDbs that stores its objects in the AnnotationHub that provides Ensembl-based annotation databases for all species.
AnnotationHub is mostly user contributed with only a handful of resources generated and automatically provided by the core team. We appreciate the feedback and can look into the orgDb generation in the AnnotationForge package or would gladly accept any contributed package containing the data missing. You also did not reference the resource you were trying to use so it is unclear if it was provided by the core or a contributed package which also would have been helpful

Bioconductor / AnnotationHub

Poorly annotated plants #23