Cellular-Semantics / CL_KG

Building a Cell Ontology Knowledge-Base from data, and LLMs
Apache License 2.0
0 stars 0 forks source link

GO annotation import pipeline #15

Open dosumis opened 3 months ago

dosumis commented 3 months ago

Aim: Import high confidence GO annotations (GO term to gene) for GO terms directly linked to CL terms - bidirectional by and object property restriction.

The links from GO terms to Genes (proteins) should follow standard GO semantics. There are 3 possible sources:

  1. QuickGO API:- Python script to generate ROBOT template --> RDF
  2. AMIGO API: - Python script to generate ROBOT template --> RDF https://amigo.geneontology.org/amigo (Ask Chris/Seth about API)
  3. OAK apparently can generate an RDF file with GO annotations for any set of GO terms. This may be ideal if the library is robust enough and this scales well. https://incatools.github.io/ontology-access-kit/guide/associations.html

Filter on species (mouse + human) + direct experimental evidence ECO:0000269

Example of GO annotation query with filters through QuickGO GUI: https://www.ebi.ac.uk/QuickGO/annotations?goUsage=descendants&goUsageRelationships=is_a,part_of,occurs_in&goId=GO:0097208&taxonId=9606&taxonUsage=descendants&evidenceCode=ECO:0000269&evidenceCodeUsage=descendants

https://www.ebi.ac.uk/QuickGO/api/index.html#!/annotations/downloadLookupUsingGET

dosumis commented 1 month ago

This requires finding direct links GO -> CL. The simplest way to get these would be from UberGraph queries

ubyndr commented 1 month ago

Hi @dosumis,

I've made some initial progress on this issue. I retrieved GO annotations from CL_KG using the following query:

MATCH (c:Cell)-[]-(g)
WHERE g.curie STARTS WITH 'GO:'
RETURN g.curie

I then queried the QuickGO annotation service with all the GO terms, except for GO:0030154 and GO:0005622 (which caused a timeout error that I'll investigate further).

Here’s the preview of the resulting table. How should we proceed from here?

Click to expand table | GO Name | Gene Product DB | Gene Product ID | Symbol | Qualifier | GO Term | GO Evidence Code | Reference | With/From | Taxon ID | Assigned By | Gene Product Name | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | retinal cone cell apoptotic process | UniProtKB | P37242 | Thrb | acts_upstream_of_or_within | GO:0097474 | IMP | PMID:20203194 | | 10090 | MGI | Thyroid hormone receptor beta | | retinal cone cell apoptotic process | UniProtKB | Q91ZI8 | Dio3 | acts_upstream_of_or_within | GO:0097474 | IMP | PMID:20203194 | | 10090 | MGI | Thyroxine 5-deiodinase | | gamma-delta T cell differentiation | UniProtKB | A7TZE6 | Skint1 | acts_upstream_of | GO:0042492 | IDA | PMID:21300860 | | 10090 | MGI | Selection and upkeep of intraepithelial T-cells protein 1 | | gamma-delta T cell lineage commitment | UniProtKB | B3F5L4 | Scart2 | acts_upstream_of_or_within | GO:0002365 | IDA | PMID:18641307 | | 10090 | MGI | SRCR domain-containing protein | | gamma-delta T cell lineage commitment | UniProtKB | B3F5L5 | Scart2 | acts_upstream_of_or_within | GO:0002365 | IDA | PMID:18641307 | | 10090 | MGI | SRCR domain-containing protein | | gamma-delta T cell differentiation | UniProtKB | P06240 | Lck | acts_upstream_of | GO:0042492 | IMP | PMID:7807014 | MGI:MGI:1857211 | 10090 | MGI | Proto-oncogene tyrosine-protein kinase LCK | | gamma-delta T cell differentiation | UniProtKB | P06800 | Ptprc | acts_upstream_of | GO:0042492 | IMP | PMID:7807014 | MGI:MGI:2181288 | 10090 | MGI | Receptor-type tyrosine-protein phosphatase C | | CD8-positive, gamma-delta intraepithelial T cell differentiation | UniProtKB | P26715 | KLRC1 | involved_in | GO:0002305 | IDA | PMID:18064301 | | 9606 | UniProt | NKG2-A/NKG2-B type II integral membrane protein | | gamma-delta T cell differentiation | UniProtKB | P42230 | Stat5a | acts_upstream_of | GO:0042492 | IGI | PMID:15294943 | MGI:MGI:103035 | 10090 | MGI | Signal transducer and activator of transcription 5A | | gamma-delta T cell differentiation | UniProtKB | P42232 | Stat5b | acts_upstream_of | GO:0042492 | IGI | PMID:15294943 | MGI:MGI:103036 | 10090 | MGI | Signal transducer and activator of transcription 5B | | gamma-delta T cell differentiation | UniProtKB | P48025 | Syk | acts_upstream_of | GO:0042492 | IMP | PMID:7477352 | MGI:MGI:2384078 | 10090 | MGI | Tyrosine-protein kinase SYK | | gamma-delta T cell differentiation | UniProtKB | P48025 | Syk | acts_upstream_of | GO:0042492 | IMP | PMID:8790395 | MGI:MGI:1857421 | 10090 | MGI | Tyrosine-protein kinase SYK | | gamma-delta T cell activation | UniProtKB | P97792 | Cxadr | involved_in | GO:0046629 | IDA | PMID:20813954 | | 10090 | UniProt | Coxsackievirus and adenovirus receptor homolog | | gamma-delta T cell differentiation | UniProtKB | Q00417 | Tcf7 | involved_in | GO:0042492 | IMP | PMID:30413363 | | 10090 | UniProt | Transcription factor 7 | | gamma-delta T cell activation | UniProtKB | Q03526 | Itk | involved_in | GO:0046629 | IMP | PMID:23562159 | | 10090 | UniProt | Tyrosine-protein kinase ITK/TSK | | gamma-delta T cell differentiation | UniProtKB | Q04891 | Sox13 | involved_in | GO:0042492 | IMP | PMID:17218525 | | 10090 | UniProt | Transcription factor SOX-13 | | gamma-delta T cell differentiation | UniProtKB | Q04891 | Sox13 | involved_in | GO:0042492 | IMP | PMID:30413363 | | 10090 | UniProt | Transcription factor SOX-13 | | gamma-delta T cell activation | UniProtKB | Q29980 | MICB | involved_in | GO:0046629 | IDA | PMID:9497295 | | 9606 | UniProt | MHC class I polypeptide-related sequence B | | gamma-delta T cell activation | UniProtKB | Q29983 | MICA | involved_in | GO:0046629 | IDA | PMID:9497295 | | 9606 | UniProt | MHC class I polypeptide-related sequence A | | gamma-delta T cell proliferation | UniProtKB | Q3TSG4 | Alkbh5 | acts_upstream_of | GO:0046630 | IMP | PMID:35939687 | | 10090 | UniProt | RNA demethylase ALKBH5 | | gamma-delta T cell activation | UniProtKB | Q80UL9 | Jaml | involved_in | GO:0046629 | IDA | PMID:20813954 | | 10090 | UniProt | Junctional adhesion molecule-like | | gamma-delta T cell lineage commitment | UniProtKB | Q8C9T4 | Scart2 | acts_upstream_of_or_within | GO:0002365 | IDA | PMID:18641307 | | 10090 | MGI | SRCR domain-containing protein | | CD8-positive, gamma-delta intraepithelial T cell differentiation | UniProtKB | Q8K1Z6 | Gpr18 | acts_upstream_of_or_within | GO:0002305 | IDA | PMID:25348153 | | 10090 | MGI | N-arachidonyl glycine receptor | | CD8-positive, gamma-delta intraepithelial T cell differentiation | UniProtKB | Q8K1Z6 | Gpr18 | acts_upstream_of_or_within | GO:0002305 | IMP | PMID:25348153 | MGI:MGI:5708707 | 10090 | MGI | N-arachidonyl glycine receptor | | gamma-delta T cell differentiation | UniProtKB | Q9QYE5 | Jag2 | involved_in | GO:0042492 | IMP | PMID:10383933 | | 10090 | UniProt | Protein jagged-2 | | CD8-positive, gamma-delta intraepithelial T cell differentiation | UniProtKB | Q9WUT7 | Ccr9 | acts_upstream_of_or_within | GO:0002305 | IMP | PMID:25348153 | MGI:MGI:3654108 | 10090 | MGI | C-C chemokine receptor type 9 | | erythrocyte maturation | UniProtKB | A8VU90 | Ankle1 | NOT|involved_in | GO:0043249 | IMP | PMID:27010503 | | 10090 | UniProt | Ankyrin repeat and LEM domain-containing protein 1 | | erythrocyte development | UniProtKB | F8VPQ2 | Arid4a | acts_upstream_of_or_within | GO:0048821 | IMP | PMID:18728284 | MGI:MGI:3687004 | 10090 | MGI | AT-rich interactive domain-containing protein 4A | | erythrocyte development | UniProtKB | G5E8Q8 | Adgrf5 | acts_up
dosumis commented 1 month ago

Converting QuickGO ouput to RDF

GO NAME GENE PRODUCT DB GENE PRODUCT ID SYMBOL QUALIFIER GO TERM GO EVIDENCE CODE REFERENCE WITH/FROM TAXON ID ASSIGNED BY GENE_PRODUCT_NAME
retinal cone cell apoptotic process UniProtKB P37242 Thrb acts_upstream_of_or_within GO:0097474 IMP PMID:20203194   10090 MGI Thyroid hormone receptor beta
(GO:Class)-[:<QUALIFIER> { has_reference: '<REFERENCE>', evidence: '<GO EVIDENCE CODE>'  }]->(protein:Class { label: '<GENE PRODUCT NAME>', symbol: '<SYMBOL>')-[]->(Species:Class}

 iri/curie mappings:

annotation properties --> literals

label: rdfs:label symbol: http://purl.obolibrary.org/obo/IAO_0000028 has_reference: http://purl.org/dc/terms/references

Classes

Protein: https://identifiers.org/uniprot: e.g. https://identifiers.org/uniprot:P0DP23

Species: http://purl.obolibrary.org/obo/ e.g.

10090 --> http://purl.obolibrary.org/obo/NCBITaxon_10090

ubyndr commented 3 weeks ago

I've added the templates and a script to the CellMark repository (see PR #3). We need to create a release for CLM and re-run the CL_KG pipeline to test it.