biolab / orange3-bioinformatics

🍊🔬 Bioinformatics add-on for Orange3
GNU General Public License v3.0
20 stars 21 forks source link

Gene Annotations - data from other databases #218

Closed Hrovatin closed 3 years ago

Hrovatin commented 4 years ago

Missing annotations: There are some genes that do not have descriptions in Orange and NCBI Gene, but have them in primary source. For example DDB_G0278971 from Dictyostelium discoideum has no description in Orange/Entrez (at least not under DDB ID), but has a description in dictyBase, which is listed as a primary source on NCBI Gene for other genes. Additionally, there are some genes (such as DDB_G0294092, DDB_G0294072) that have descriptions in dictyBase and UniProt and are included in some gene ontologies (e.g. https://www.ebi.ac.uk/QuickGO/search/DDB_G0294092?geneProductId=O21049), but are not linked on NCBI Gene (possibly this record for DDB_G0294092 https://www.ncbi.nlm.nih.gov/gene/2193893) and not annotated in Orange. Thus if individual organism databases would be too hard to incorporate, maybe there could be an UniProt annotator, which seems to contain some genes missing from NCBI. Additionally, UniProt offers descriptions based on homology.

Additional information that could be displayed: Since study of noncoding RNAs is becoming increasingly popular, there could be another category in Gene widget showing "Gene type" from NCBI - e.g. protein coding,... So that the user could see if gene encodes a protein or not.

JakaKokosar commented 4 years ago

This is a very relevant issue yet I think it is out of scope for this project unless we heavily commit man-hours into this. The biggest issue here I think is that we don't want to become yet another source of curated data from various databases. This adds a lot of maintenance overhead and knowledge on this subject. While it would be great to support this in full (contributions are welcome :P) I just don't see an easy way of doing this right now.

The reason we use NCBI as a source is that it simplifies things a lot for us: 1) We gather all relevant gene information. 2) GO annotations and GEO datasets are already mapped to Entrez Ids. 4) Panglao and cellMarker (marker genes databases) also use Entrez ID. 5) Widgets expect and work with Entrez IDs to reduce code complexity, ...

There are downsides to this. The issue you have and also this: https://github.com/biolab/orange3-bioinformatics/issues/119

When I started working on this add-on I quickly realized that even a simple task like mapping gene symbols to a given ID can be a challenge if you want to do it properly. Personally, I would like to avoid this unless there is a general consensus on how are we going to implement/support this.

@BlazZupan and @mstrazar what are your thoughts on this subject?

BlazZupan commented 4 years ago

I agree with Jaka, and while this is an open issue, there are other issues in Orange development that more urgently need attention.