A1 - Core Brick Creation

tomlue commented 1 year ago

From the grant

In this deliverable a set of 6+ assets will be chosen to transform into local, isolated, knowledge graphs. Eventually every compatible BioBricks-AI asset will be added to the knowledge graph.

Let's go a step farther and actually create those local knowledge graphs.

[x] Tox21
[x] ToxCast
[x] HGNC @john-shaffer
[x] MESH
[x] ICE
[x] build Uniprot-kg and integrate

zmughal commented 1 year ago

A possible route is to use RDF HDT format to create locally accessible graphs: https://www.rdfhdt.org/what-is-hdt/.

This can then be either used directly for queries on that particular graph using a query API such as:

or loaded into any graph database. An RDF DB would be easiest, but also a property graph DB provided a schema can be supplied (this may be where Biolink can come in):

example with JanusGraph https://github.com/mpolonioli/rdf-to-janusgraph/blob/master/rdf-to-janusgraph/src/main/java/net/mpolonioli/rdftojanusgraph/RdfToJanusGraph.java.

zmughal commented 1 year ago

Re: Biolink and connecting with #2 and #3,

For the property graph DB: https://biolink.github.io/biolink-model/about/mapping-neo4j
For some of the kinds of entities being modeled: https://biolink.github.io/biolink-model/docs/ChemicalEntity.html. I will need to look more into how specific identifiers are modeled.

zmughal commented 1 year ago

Just wanted to note this here since it will come up when these graphs are merged:

While I was specifically looking at ctdbase, I noticed that the ctdbase$CTD_chemicals$ChemicalID column is supposed to be

ChemicalID (MeSH identifier)

per https://ctdbase.org/downloads/#allchems, but as I was trying to characterise which MeSH terms are used, I noticed "MESH:D" in the data:

R code looking at `nchar` for each ChemicalID

```r > ctdbase <- bbload('ctdbase'); ctdbase$CTD_chemicals |> mutate( len = nchar(ChemicalID) ) |> count(len) |> collect() [04/Sep/2023 00:38:27] INFO - checking 7ff45fac905951febda25e2ab26e014990adcc0e for ctdbase [1] "loading CTD_exposure_events" [1] "loading CTD_diseases" [1] "loading CTD_chemicals_diseases" [1] "loading CTD_diseases_pathways" [1] "loading CTD_chem_gene_ixn_types" [1] "loading CTD_genes" [1] "loading CTD_chem_pathways_enriched" [1] "loading CTD_pathways" [1] "loading CTD_pheno_term_ixns" [1] "loading CTD_Phenotype-Disease_cellular_component_associations" [1] "loading CTD_genes_diseases" [1] "loading CTD_anatomy" [1] "loading CTD_Phenotype-Disease_biological_process_associations" [1] "loading CTD_Phenotype-Disease_molecular_function_associations" [1] "loading CTD_chem_gene_ixns" [1] "loading CTD_chemicals" [1] "loading CTD_genes_pathways" [1] "loading CTD_exposure_studies" [1] "loading CTD_chem_go_enriched" # A tibble: 3 × 2 len n 1 12 154507 2 15 21402 3 6 1 > ctdbase$CTD_chemicals |> filter(! grepl("^MESH:[CD].", ChemicalID ) ) |> collect() # A tibble: 1 × 8 ChemicalName ChemicalID CasRN Definition ParentIDs TreeNumbers * 1 Chemicals MESH:D NA NA NA D # ℹ 2 more variables: ParentTreeNumbers , Synonyms ```

MeSH:D (if it existed in MeSH) would not be an instance of ChemicalEntity itself, but the concept/class. So would many parent terms in the CTD Chemical Vocabulary. Again, will need to look at how Biolink deals with this.

CTD's description of their Chemical Vocabulary indicates that this isn't exactly the MeSH vocabulary, but modified in some places (MESH:D is one example). I will have to see how other approaches deal with this such as https://robokop.renci.org/api-docs/docs/automat/ctd.

tomlue commented 1 year ago

Mesh terms aren't a great approach for chemical identifier, if ctdbase doesn't have a closer to 1-1 mapping then it might be better to start with a different source. When we do integrate ctdbase, we will need to associate chemicals with their mesh term (or whatever identifier ctdbase is using).

Pubchem annotations have this information, but crawling all of pubchem and being polite will take too long, and they don't have a bulk download of annotations. I have been trying to reach out to pubchem about this for some time (https://twitter.com/pubchem/status/1686056545337917441), but I can increase my efforts. There is a pubchem brick right now, but it has bioassay and chemical sdf data, and only a small subset of the annotation data.

I'm open to any solutions you have, just moving along to other assets might be best. ICE and chembl are probably good choices.

tomlue commented 1 year ago

@zmughal Toxicokinetics was a highlighted topic at eurotox. John Wambaugh talked about how lack of data is a big concern in that space. The package HTTK currently distributes some toxicokinetics data, but I'm not sure where it comes from. Maybe we need an httk brick? Or a brick for the sources it pulls from? Adding toxicokinetics to OKG would probably be helpful for the tox community.

biobricks-ai / biobricks-okg

A1 - Core Brick Creation #1