biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
9 stars 10 forks source link

adding knowledge_level / agent_type (KL/AT) edge-attributes to all edges (Spring 2024 Translator feature) #792

Closed colleenXu closed 2 months ago

colleenXu commented 6 months ago

The Translator consortium wants knowledge_level / agent_type (KL/AT) edge-attributes added to all edges.

The format for the edge-attributes is something like this:

{
   "attribute_type_id": "biolink:knowledge_level",
   "value": "knowledge_assertion"
},
{
   "attribute_type_id": "biolink:agent_type",
   "value": "manual_agent"
}
**CLICK HERE** for my interpretation of what KL/AT is and what the terms mean

**`knowledge_level`**: in general terms, "where / how" was this knowledge generated? * `knowledge_assertion`: asserted to be true. Google doc says this is the default, since most statements curated from literature / from authoritative knowledgebases count as this * `logical_entailment`: from logic (related to ontologies) * `prediction`: more speculative "hypotheses or possible facts". Google doc says creative-mode overarching edges count as well as any KP of "predictions" * `statistical_association`: using association/correlation predicates, from KPs working with EHR/omics data * `observation`: "we report this is happening" (adverse event / clinical trials) * `not_provided`: can't tell what to pick. Use for text-mined edges, since they aren't picking up those nuances **`agent_type`**: in general terms, "who / what" generated or asserted the knowledge represented on the edge? * `manual_agent`: human decided, made the assertion * `manual_validation_of_automated_agent`: human reviewed/validated what an automated agent generated (very subtle distinction, not clear if we'll use it) * `automated_agent`: software-generated, human didn't decide/review the specific assertion. Can use this term directly, or one of its more-specific children * `data_analysis_pipeline`: statistical association/correlation, using association/correlation predicates (not using rules/inference to say anything bigger/stronger about the relationship) * `computational_model`: using rules/inference to say anything bigger/stronger about the relationship, or some kind of machine learning * `text_mining_agent`: used NLP to get the entities/relationship-type (ID, node category, edge predicate) * `image_processing_agent`: from images (like PFOCR) * `not_provided`: can't tell what to pick

Documentation:

What needs implementing

Our end

  1. Add knowledge_level and agent_type fields to x-bte annotation for Service-Provider-only APIs ➡️ transform into TRAPI edge-attributes. We can coordinate between me and another dev (probably Jackson @tokebe)
  2. add edge-attributes for the edges our tool generates (3 kinds?):
    • for subclass_of: we get these from ontologies/vocabs - both service-provider and BTE return these kinds of edges.
    • for the "inferred" edges built from the subclass_of + KP edge: knowledge_level = logical_entailment, agent_type = automated_agent (according to Matt Brush, Translator Slack link). both service-provider and BTE return these kinds of edges.
    • for the creative-mode "inferred" edge made from a template: knowledge_level = prediction, agent_type = computational_model. Only BTE returns these kinds of edges.

For text-mining / multiomics: two possible options?

For TRAPI KPs, we ingest their edge-attributes (so we leave it to them to implement KL/AT on their edges).


Notes:

(1) there seems to be a hierarchy to the values (see automated_agent). We want to keep this in mind if we ever want to query these as QEdge.attribute_constraints (traverse this hierarchy?). We last discussed these kinds of constraints in https://github.com/biothings/biothings_explorer/issues/482#issuecomment-1691178761, but the hierarchy of terms only applied to qualifier stuff.

(2) To team: let's not include attribute_source fields in these edge-attributes (existed in the examples). As confirmed by Matt Brush (Translator Slack link), these are optional fields with the infores ID of "who assigned the KL/AT terms".

I think it's a little complicated to implement (expand to see notes)

* what about the subclass-related edges, which show up in service-provider-team endpoint responses and BTE responses? * service-provider-trapi in Service-Provider-only KP edges * biothings-explorer for edges built from templates for creative-mode

(3) Matt Brush said subclass edges from CL, UBERON would also have agent_type=manual_agent (Translator Slack link). We don't support these yet.

colleenXu commented 5 months ago

Text-Mining / Multiomics KP situation

With Everaldo/Chunlei (Service Provider side), the CI instances of the following BioThings APIs will be updated with KL/AT edge-attributes

(no news yet: Drug response and text-mining targeted)

We'll watch to see if this works as-expected (aka BTE ingests and displays these edge-attributes).

colleenXu commented 5 months ago
Notes on UMLS "subclass" relationships

The [node-expansion module](https://github.com/biothings/node-expansion) appears to be using a [parsed version of the Metathesaurus MRREL.REF file](https://github.com/biothings/node-expansion/blob/e4bb7865135ffdf8b4fc30b8960888c27d29c337/src/config.js#L24C1-L24C59). However, it's not clear to me how the file was parsed. There are 2 types of things I think of, that would have been used: * parent/child (REL = PAR/CHD) * broader/narrower (REL = RB/RN) My notes: * MRREL contains immediate parent/child relationships (ref: [reference manual](https://www.ncbi.nlm.nih.gov/books/NBK9684/#ch02.sec2.4) 2.4.1.1) * MRREL has a REL field that can be parent/child, broader/narrower (ref: [reference manual](https://www.ncbi.nlm.nih.gov/books/NBK9684/#ch02.sec2.4) 2.4.2, REL abbreviations table on this [page](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html)) * parent/child comes from source vocab, VS broader/narrower added by UMLS editors (humans?) (ref: REL abbreviations table on this [page](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html), 2nd page of this [paper](https://www.d.umn.edu/~tpederse/Pubs/naacl2013-umls-sim-demo.pdf)) Problems? * ["NLM does not assert parent or child relationships between concepts."](https://documentation.uts.nlm.nih.gov/rest/relations/)? * [issues in UMLS representing hierarchy of original source vocab](https://www.sciencedirect.com/science/article/pii/S1532046410001073)

colleenXu commented 5 months ago

Note: Monarch API plans to add KL/AT fields. https://github.com/monarch-initiative/monarch-app/issues/675 If we want to use these, we'd need to adjust our custom post-processing of their responses (as a separate but related issue)

colleenXu commented 5 months ago

@tokebe

I've added knowledge_level and agent_type fields to all x-bte annotation that needs it. And just-in-case, I think we should add these two edge-attributes to our edge-hash (since we don't want edges created that have multiple KL/AT values from merges).

colleenXu commented 5 months ago

And my notes on the curation process:

I've annotated yamls that are still in-progress:

(1) There may be typos in the field names or values! Because I manually added these w/o any automated validation to help >.<. I already fixed a typo (knowledge_type -> knowledge_level).

(2) There were many cases where I wasn't sure what terms to pick:

Trouble assigning both values

AGR disease-gene associations: I'm picking these based on my guesses of what's going on… * if it wasn't "via orthology": using `knowledge_assertion` / `manual_agent`. * if it's "via orthology": using `logical_entailment` / `manual_validation_of_automated_agent` DISEASES: * knowledge_level: I picked `not_provided`. Right now, it's a mix because we don't separate by evidence value * text-mined -> `not_provided` * experiments -> `statistical_association` * knowledge -> `knowledge_assertion` * agent_type: I picked `automated_agent` since I assumed there's an automated pipeline for processing all the sources, regardless of evidence type. But the papers aren't super clear on this ([2022](https://doi.org/10.1093/database/baac019), [2015](https://www.sciencedirect.com/science/article/pii/S1046202314003831)). MGIgene2pheno: I'm picking `knowledge_assertion` / `manual_agent` based on my guesses of what's going on. I've skimmed this [FAQ](https://www.informatics.jax.org/userhelp/disease_connection_help.shtml#hdp_results) MyChem: * [aeolus](https://www.nature.com/articles/sdata201626): picked `observation`/`manual_agent`. seems like humans originally made the reports, but an automated pipeline was used to assign IDs. * chembl: seems to be manual curation, so I picked `knowledge_assertion` / `manual_agent`. But it's a lot of reading to understand exactly where the data is coming from ([paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4607714/) linked by [recent update article](https://doi.org/10.1093/nar/gkad1004)) * drugcentral: using this [paper](https://doi.org/10.1093/nar/gkac1085) as reference * bioactivity: seems to be a mix of manual curation and automatic ingest from other resources ("Current data" -> "Bioactivities and drug targets" section) * contraindications, drug use, off-label: manually curated according to first paragraph of intro. * adverse events: from faers. Same issue as aeolus, so I picked the same values. * fda-orphan-drug-db: picked `observation` / `not_provided`. Since it's a database of applications for designations/approvals… MyDisease: * what to do for disgenet ([paper](https://doi.org/10.1093/nar/gkz1021), [website](https://www.disgenet.org/dbinfo)): * knowledge_level: I picked `not_provided`. Right now, it's a mix because we don't separate by underlying source * agent_type: I picked `automated_agent` since I assumed there's an automated pipeline for processing/integrating all the sources * what to do for disease-pheno from hpo-annotations: I'm picking `knowledge_assertion` / `manual_agent` based on assumptions. But in the [evidence part of "phenotype.hpoa format"](https://hpo.jax.org/app/data/annotation-format), it's implied that some info comes from parsing the omim data and I'm not sure how that affects this. MyGene: * what to do for ConsensusPathDB/cpdb ([paper](https://doi.org/10.1093/nar/gkab1128), [website](http://cpdb.molgen.mpg.de/CPDB)) - aggregator: * knowledge_level: I picked `knowledge_assertion`. But I don't know - does it depend on what cpdb is doing or what the underlying sources are doing (KEGG, wikipathways, biocarta) * agent_type: I picked `automated_agent` since I assume cpdb is using an automated pipeline to processing/integrating all the sources * ncbi-gene: same issues as cpdb, it's an aggregator. Picked same knowledge_level / agent_type as above * [how it gets its go-annotations](https://www.ncbi.nlm.nih.gov/books/NBK3840/#genefaq.GO_terms) * panther (orthologs): picked `knowledge_assertion` / `computational_model`. [Paper](https://onlinelibrary.wiley.com/doi/10.1002/pro.4218) Figure 4 seems to show that automated pipeline creates orthologs, not really any manual-curation. repoDB: * approved drug indications basically downloaded from drugcentral/drugbank. So I picked `knowledge_assertion` / `automated_agent` but maybe another term based on drugcentral/drugbank methods would be better? * non-approved drug info from data parsing/cleaning clinicaltrials.gov data. So I picked `observation` / `automated_agent`

Issues assigning agent_type

Picked `not_provided`: * foodb: can't find any info on their process. No publication, [website](https://foodb.ca/) says "(obtained from literature)". I can find [cases](https://foodb.ca/compounds/FDB000004) where the food component content is from a different database (phenol explorer) * fooddata central: can't tell if their process involves human/manual effort vs automated effort. Seems to report experimental data. Ref: [data sources](https://fdc.nal.usda.gov/data-documentation.html), [FAQ](https://fdc.nal.usda.gov/faq.html#q1) * hpo gene-to-pheno: can't find any info on their process. Info on [webpage's](https://hpo.jax.org/app/data/annotation-format) "genes_to_phenotype.txt format" section is vague. * monarch: not sure what to pick - depends on underlying primary source? [And they may add their own KL/AT assignments](https://github.com/biothings/biothings_explorer/issues/792#issuecomment-2044133771)… * pharmgkb: when the relationship wasn't listed [here](https://www.pharmgkb.org/page/faqs#what-data-are-manually-curated) as manually-curated. Then I couldn't tell how this assertion was made Unsure: * [bindingdb](https://www.bindingdb.org/): could count as `manual_agent`(website shows ~half the data is "curated")? But I picked `manual_validation_of_automated_agent` based on the line in ["Data Collection" section](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702793/): > Data imported from other databases, such as PubChem and ChEMBL, are automatically checked for completeness and certain easily detected errors, and any data flagged by these procedures are reviewed manually and corrected if needed. * dgidb: seems to use automated pipeline to ingest many resources (ref: [2021 paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778926/), VS [2024 paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10767982/) is more vague). So I picked `automated_agent`... * ebi-proteins uniprot-to-rhea: I'm assuming we are primarily using Swiss-Prot entries, which are human-curated ([ref](https://www.rhea-db.org/help/enzyme%2Dcatalyzing%2Dreaction)). But Trembl would be `automated_agent`... * iPTMnet: some info seems to be text-mined, vs imported from curated databases (ref: [paper materials and methods](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753337/#SEC2title)). So I put `automated_agent` * pharmgkb: assuming `manual_agent`. But it's unclear what info in pharmgkb isn't manually curated. There is a [list](https://www.pharmgkb.org/page/faqs#what-data-are-manually-curated) of what is * rampDB: I put `automated_agent`. currently only looking at pathway info, which seems to come from automated pipeline importing from multiple resources: HMDB, KEGG, WikiPathways, Reactome. Plus some manual curation for chemical/metabolite ID mappings.

colleenXu commented 5 months ago

@tokebe

I've updated the posts above since all the x-bte annotation work is done.

The rest of step 1 (ingesting/formatting the x-bte fields) + step 2 are yours?

colleenXu commented 4 months ago

@tokebe

For the KL/AT edge-attributes from x-bte annotation...

The format for the constructed edges looks correct/good. I saw examples of all 3 cases.

(Based on a quick review only)

tokebe commented 4 months ago

Latest commits should fix these.

tokebe commented 4 months ago

Related https://github.com/biothings/biothings_explorer/issues/715 could be done after this issue is reasonably done.

colleenXu commented 3 months ago

Update on Monarch (earlier comment in this issue):

I've updated the KL/AT assigments for Monarch API operations, using the info provided in https://github.com/monarch-initiative/monarch-app/issues/675#issuecomment-2138298559. So we're good for now!

colleenXu commented 2 months ago

The code was deployed today to Prod as part of the Octopus release. I tested and it's live.

I'm closing this issue because our side of the work is done. However, note that Text-Mining/Multiomics haven't updated their BioThings APIs for all instances to provide KL/AT edge-attributes yet (was keeping notes in a comment here)