dhimmel / gene-ontology

User-friendly Gene Ontology annotations
https://git.dhimmel.com/gene-ontology/
7 stars 7 forks source link

Discrepancy Between 2015 Gene Ontology Source Data and hetio Biological Process Annotations #5

Open NegarJanani opened 3 weeks ago

NegarJanani commented 3 weeks ago

I'm seeking the source data for the Gene Ontology (Biological Process) in hetio for 2015. According to this manuscript, hetio contains 559,504 annotations linking Biological Processes to genes. The manuscript cites three data sources: Zenodo, the GOA database, and the 2015 Gene Ontology annotations available on GitHub. The paper also mentions that annotations were propagated.

After reviewing the data from these sources, I found the following:

GO_annotations/9606/biological process/inferred all evidence: 1,212,012 annotations GO_annotations/9606/biological process/inferred experimental evidence: 405,342 annotations GO_annotations/9606/biological process/direct all evidence: 127,874 annotations GO_annotations/9606/biological process/direct experimental evidence: 31,676 annotations

None of these datasets indicate 559,504 annotations for Biological Processes in hetio. Additionally, obsolete annotations such as GO:0000003 and GO:0000117 are present in the updated 2018 version of the Gene Ontology annotations, even though hetio, which is based on the 2015 manuscript, was created earlier.

@d33bs from @greenelab's software engineering team attempted to recover data from the Hetontology repository, but this effort failed due to the archived and abandoned state of the OLS software.

I'm curious about the origin of these discrepancies. I need the original Gene Ontology annotations for Biological Processes from 2015, as well as the most recent updated data, to test methods for predicting new annotations in hetio that were not present in the original dataset.

dhimmel commented 3 weeks ago

Copying the relevant sections of the manuscript below for reference. Your comment summarizes them well.

Biological processes, cellular components, and molecular functions were extracted from the Gene Ontology. Only terms with 2–1000 annotated genes were included. ... Gene–participates–Biological Process, Gene–participates–Cellular Component, and Gene–participates–Molecular Function edges are from Gene Ontology annotations [157]. As described in Intermediate resources, annotations were propagated [158,159].

Looking at cell 19-20 of integrate.ipynb we can see how the output from dhimmel/gene-ontology was converted into nodes and edges. We see Hetionet v1.0 used this version of GO_annotations-9606-inferred-allev.tsv.

So to the main question of why the GO annotation count (1,212,012 according to your calculation) does not match the edge count (559,504) in Hetionet for Gene–participates–Biological Process. It likely comes down to this code where:

  1. we filter for protein coding NCBI genes that are in the Hetionet gene vocabulary.
  2. we omit biological processes whose annotation counts are outside of 2-1000.
    genes = coding_genes & set(map(int, row.gene_ids.split('|')))
    if 2 > len(genes) or len(genes) > 1000:
        continue

I need the original Gene Ontology annotations for Biological Processes from 2015, as well as the most recent updated data

It would be valuable to update https://github.com/dhimmel/gene-ontology with the latest data, so happy to help if you would like to take on that task.

NegarJanani commented 3 weeks ago

Thank you very much, @dhimmel, for your prompt and clear response.

The https://github.com/dhimmel/gene-ontology is very user-friendly, and I’d like to contribute by updating it. I appreciate your help—please let me know where I can start.