Cellular-Semantics / CL_KG

Building a Cell Ontology Knowledge-Base from data, and LLMs
Apache License 2.0
0 stars 0 forks source link

EPIC: Write Makefile to drive RDF-OWL generation for KG construction from CxG corpus #1

Closed dosumis closed 3 months ago

dosumis commented 6 months ago

Pipeline will generate rdf-owl files by running Pandasaurus_CxG across CellXGene datasets accessed via CxG census.
Configuration of labelsets will come from:

Some work to extend pandsaurus_cxg may be needed

TBD: should we be storing cell ID lists?

dosumis commented 6 months ago

MVP:

  1. Write Makefile that:

    • reads CSV input - detailing datasets and author_category fields (CSV to be provided by @dosumis)
    • Pulls Anndata file from Census
    • Runs Pandasaurus to generate RDF
  2. Run OBASK,

    • importing all output RDF files from step 1.
    • Ontologies: CL; GO; Uberon

Next steps:

  1. Extend Pandasaurus to pull more ontology metadata from CxG AnnData & to include dataset metadata in RDF output.
    • Add dataset node. Populate with metadata in uns + UUID of dataset. <-- DO THIS FIRST.
  2. Write Python module to pull GO annotations from the QuickGO API for the GO terms with direct links to CL terms in graph. @dosumis to provide spec.
  3. Write import pipeline to pull in NS-Forest markers from Richard Scheuermann' analysis (markers should be looked up in anndata var).

Challenges:

Reconciling/linking gene & protein IDs. These may come in as PRO, uniprot, ensembl + we need to record species. Decision to make: aim for unified gene nodes that collect IDs? (ID label pairs?)