adding knowledge_level / agent_type (KL/AT) edge-attributes to all edges (Spring 2024 Translator feature)

colleenXu commented 6 months ago

The Translator consortium wants knowledge_level / agent_type (KL/AT) edge-attributes added to all edges.

The format for the edge-attributes is something like this:

{
   "attribute_type_id": "biolink:knowledge_level",
   "value": "knowledge_assertion"
},
{
   "attribute_type_id": "biolink:agent_type",
   "value": "manual_agent"
}

**CLICK HERE** for my interpretation of what KL/AT is and what the terms mean

**`knowledge_level`**: in general terms, "where / how" was this knowledge generated? * `knowledge_assertion`: asserted to be true. Google doc says this is the default, since most statements curated from literature / from authoritative knowledgebases count as this * `logical_entailment`: from logic (related to ontologies) * `prediction`: more speculative "hypotheses or possible facts". Google doc says creative-mode overarching edges count as well as any KP of "predictions" * `statistical_association`: using association/correlation predicates, from KPs working with EHR/omics data * `observation`: "we report this is happening" (adverse event / clinical trials) * `not_provided`: can't tell what to pick. Use for text-mined edges, since they aren't picking up those nuances **`agent_type`**: in general terms, "who / what" generated or asserted the knowledge represented on the edge? * `manual_agent`: human decided, made the assertion * `manual_validation_of_automated_agent`: human reviewed/validated what an automated agent generated (very subtle distinction, not clear if we'll use it) * `automated_agent`: software-generated, human didn't decide/review the specific assertion. Can use this term directly, or one of its more-specific children * `data_analysis_pipeline`: statistical association/correlation, using association/correlation predicates (not using rules/inference to say anything bigger/stronger about the relationship) * `computational_model`: using rules/inference to say anything bigger/stronger about the relationship, or some kind of machine learning * `text_mining_agent`: used NLP to get the entities/relationship-type (ID, node category, edge predicate) * `image_processing_agent`: from images (like PFOCR) * `not_provided`: can't tell what to pick

Documentation:

This seems to be the most up-to-date starting spec (there's some in other branches too)
- Goes with this google doc for more info
- also seems to go with the info in this biolink-model PR https://github.com/biolink/biolink-model/pull/1470/files
- some advice for treats edges
Some discussion in this PR
Some info in this larger doc

What needs implementing

Our end

Add knowledge_level and agent_type fields to x-bte annotation for Service-Provider-only APIs ➡️ transform into TRAPI edge-attributes. We can coordinate between me and another dev (probably Jackson @tokebe)
add edge-attributes for the edges our tool generates (3 kinds?):
- for subclass_of: we get these from ontologies/vocabs - both service-provider and BTE return these kinds of edges.
  - knowledge_level = knowledge_assertion for all vocabs
  - for agent_type = manual_agent for all current vocabs. It may differ in future ones. (Translator Slack link for ontologies, including MONDO/HP/DOID/CHEBI, Translator Slack link for UMLS)
- for the "inferred" edges built from the subclass_of + KP edge: knowledge_level = logical_entailment, agent_type = automated_agent (according to Matt Brush, Translator Slack link). both service-provider and BTE return these kinds of edges.
- for the creative-mode "inferred" edge made from a template: knowledge_level = prediction, agent_type = computational_model. Only BTE returns these kinds of edges.

For text-mining / multiomics: two possible options?

they update their parsers/we help deploy their API contents so their edges have these edge-attributes. x-bte annotation/BTE is already set up to ingest these automatically.
- However, there will probably be a staggered deployment through the ITRB instances. Then we can try adding the instance/maturity-specific server urls to their SmartAPI yamls (ex: Text-Mining Targeted), and double-check that this works (that the expected maturities are using the updated APIs)
they use the x-bte annotation additions like ours (point 1 above). However, we'd want to check if that works without issue with the already-TRAPI-formatted edge-attributes ingest BTE does.

For TRAPI KPs, we ingest their edge-attributes (so we leave it to them to implement KL/AT on their edges).

Notes:

(1) there seems to be a hierarchy to the values (see automated_agent). We want to keep this in mind if we ever want to query these as QEdge.attribute_constraints (traverse this hierarchy?). We last discussed these kinds of constraints in https://github.com/biothings/biothings_explorer/issues/482#issuecomment-1691178761, but the hierarchy of terms only applied to qualifier stuff.

(2) To team: let's not include attribute_source fields in these edge-attributes (existed in the examples). As confirmed by Matt Brush (Translator Slack link), these are optional fields with the infores ID of "who assigned the KL/AT terms".

I think it's a little complicated to implement (expand to see notes)

* what about the subclass-related edges, which show up in service-provider-team endpoint responses and BTE responses? * service-provider-trapi in Service-Provider-only KP edges * biothings-explorer for edges built from templates for creative-mode

(3) Matt Brush said subclass edges from CL, UBERON would also have agent_type=manual_agent (Translator Slack link). We don't support these yet.

colleenXu commented 5 months ago

Text-Mining / Multiomics KP situation

With Everaldo/Chunlei (Service Provider side), the CI instances of the following BioThings APIs will be updated with KL/AT edge-attributes

Wellness (Gwênlyn):
- Deployed to CI, not Test yet: data/parser update. x-bte operations are now working for CI. For older discussion, see Translator Slack
- DONE: registered SmartAPI yaml adjusted to include all instance URLs (commit)
EHR risk (Kamileh):
- Deployed to CI, not Test yet: data/parser update. x-bte operations are working in CI, but there's formatting issues (Translator Slack link)
- DONE: registered SmartAPI yaml adjusted to include all instance URLs
ClinicalTrials (Kamileh):
- Haven't deployed yet: data/parser update. In the previous request, there were formatting issues (Translator Slack link)
- DONE: registered SmartAPI yaml adjusted to include all instance URLs

(no news yet: Drug response and text-mining targeted)

We'll watch to see if this works as-expected (aka BTE ingests and displays these edge-attributes).

colleenXu commented 5 months ago

Notes on UMLS "subclass" relationships

The [node-expansion module](https://github.com/biothings/node-expansion) appears to be using a [parsed version of the Metathesaurus MRREL.REF file](https://github.com/biothings/node-expansion/blob/e4bb7865135ffdf8b4fc30b8960888c27d29c337/src/config.js#L24C1-L24C59). However, it's not clear to me how the file was parsed. There are 2 types of things I think of, that would have been used: * parent/child (REL = PAR/CHD) * broader/narrower (REL = RB/RN) My notes: * MRREL contains immediate parent/child relationships (ref: [reference manual](https://www.ncbi.nlm.nih.gov/books/NBK9684/#ch02.sec2.4) 2.4.1.1) * MRREL has a REL field that can be parent/child, broader/narrower (ref: [reference manual](https://www.ncbi.nlm.nih.gov/books/NBK9684/#ch02.sec2.4) 2.4.2, REL abbreviations table on this [page](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html)) * parent/child comes from source vocab, VS broader/narrower added by UMLS editors (humans?) (ref: REL abbreviations table on this [page](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html), 2nd page of this [paper](https://www.d.umn.edu/~tpederse/Pubs/naacl2013-umls-sim-demo.pdf)) Problems? * ["NLM does not assert parent or child relationships between concepts."](https://documentation.uts.nlm.nih.gov/rest/relations/)? * [issues in UMLS representing hierarchy of original source vocab](https://www.sciencedirect.com/science/article/pii/S1532046410001073)

colleenXu commented 5 months ago

Note: Monarch API plans to add KL/AT fields. https://github.com/monarch-initiative/monarch-app/issues/675 If we want to use these, we'd need to adjust our custom post-processing of their responses (as a separate but related issue)

colleenXu commented 5 months ago

@tokebe

I've added knowledge_level and agent_type fields to all x-bte annotation that needs it. And just-in-case, I think we should add these two edge-attributes to our edge-hash (since we don't want edges created that have multiple KL/AT values from merges).

Service-Provider-only stuff (not Text-Mining/Multiomics)
only edited the yamls that dev/ci are going to use
- master branch for most (first commit, second commit + typo fix)
- biolink-4-update for idisk, mychem, repodb, semmeddb, suppkg, ttd.
  - Note that semmeddb is ONLY updated in this branch (commit), due to the major diff in operations between this branch and master branch.
- source_record_urls for bindingdb, rare-source, mygene

colleenXu commented 5 months ago

And my notes on the curation process:

I've annotated yamls that are still in-progress:

RaMP-DB (commit) from #705
pharmgkb from #556

(1) There may be typos in the field names or values! Because I manually added these w/o any automated validation to help >.<. I already fixed a typo (knowledge_type -> knowledge_level).

(2) There were many cases where I wasn't sure what terms to pick:

Usually the problem is figuring out how the knowledge was generated, the level of human involvement, or what term to pick (especially when there's an automated pipeline/aggregation of sources/multiple methods involved).
Andrew's advice: if you can't figure it out in a few min, just pick not_provided

Trouble assigning both values

AGR disease-gene associations: I'm picking these based on my guesses of what's going on… * if it wasn't "via orthology": using `knowledge_assertion` / `manual_agent`. * if it's "via orthology": using `logical_entailment` / `manual_validation_of_automated_agent` DISEASES: * knowledge_level: I picked `not_provided`. Right now, it's a mix because we don't separate by evidence value * text-mined -> `not_provided` * experiments -> `statistical_association` * knowledge -> `knowledge_assertion` * agent_type: I picked `automated_agent` since I assumed there's an automated pipeline for processing all the sources, regardless of evidence type. But the papers aren't super clear on this ([2022](https://doi.org/10.1093/database/baac019), [2015](https://www.sciencedirect.com/science/article/pii/S1046202314003831)). MGIgene2pheno: I'm picking `knowledge_assertion` / `manual_agent` based on my guesses of what's going on. I've skimmed this [FAQ](https://www.informatics.jax.org/userhelp/disease_connection_help.shtml#hdp_results) MyChem: * [aeolus](https://www.nature.com/articles/sdata201626): picked `observation`/`manual_agent`. seems like humans originally made the reports, but an automated pipeline was used to assign IDs. * chembl: seems to be manual curation, so I picked `knowledge_assertion` / `manual_agent`. But it's a lot of reading to understand exactly where the data is coming from ([paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4607714/) linked by [recent update article](https://doi.org/10.1093/nar/gkad1004)) * drugcentral: using this [paper](https://doi.org/10.1093/nar/gkac1085) as reference * bioactivity: seems to be a mix of manual curation and automatic ingest from other resources ("Current data" -> "Bioactivities and drug targets" section) * contraindications, drug use, off-label: manually curated according to first paragraph of intro. * adverse events: from faers. Same issue as aeolus, so I picked the same values. * fda-orphan-drug-db: picked `observation` / `not_provided`. Since it's a database of applications for designations/approvals… MyDisease: * what to do for disgenet ([paper](https://doi.org/10.1093/nar/gkz1021), [website](https://www.disgenet.org/dbinfo)): * knowledge_level: I picked `not_provided`. Right now, it's a mix because we don't separate by underlying source * agent_type: I picked `automated_agent` since I assumed there's an automated pipeline for processing/integrating all the sources * what to do for disease-pheno from hpo-annotations: I'm picking `knowledge_assertion` / `manual_agent` based on assumptions. But in the [evidence part of "phenotype.hpoa format"](https://hpo.jax.org/app/data/annotation-format), it's implied that some info comes from parsing the omim data and I'm not sure how that affects this. MyGene: * what to do for ConsensusPathDB/cpdb ([paper](https://doi.org/10.1093/nar/gkab1128), [website](http://cpdb.molgen.mpg.de/CPDB)) - aggregator: * knowledge_level: I picked `knowledge_assertion`. But I don't know - does it depend on what cpdb is doing or what the underlying sources are doing (KEGG, wikipathways, biocarta) * agent_type: I picked `automated_agent` since I assume cpdb is using an automated pipeline to processing/integrating all the sources * ncbi-gene: same issues as cpdb, it's an aggregator. Picked same knowledge_level / agent_type as above * [how it gets its go-annotations](https://www.ncbi.nlm.nih.gov/books/NBK3840/#genefaq.GO_terms) * panther (orthologs): picked `knowledge_assertion` / `computational_model`. [Paper](https://onlinelibrary.wiley.com/doi/10.1002/pro.4218) Figure 4 seems to show that automated pipeline creates orthologs, not really any manual-curation. repoDB: * approved drug indications basically downloaded from drugcentral/drugbank. So I picked `knowledge_assertion` / `automated_agent` but maybe another term based on drugcentral/drugbank methods would be better? * non-approved drug info from data parsing/cleaning clinicaltrials.gov data. So I picked `observation` / `automated_agent`

Issues assigning agent_type

Picked `not_provided`: * foodb: can't find any info on their process. No publication, [website](https://foodb.ca/) says "(obtained from literature)". I can find [cases](https://foodb.ca/compounds/FDB000004) where the food component content is from a different database (phenol explorer) * fooddata central: can't tell if their process involves human/manual effort vs automated effort. Seems to report experimental data. Ref: [data sources](https://fdc.nal.usda.gov/data-documentation.html), [FAQ](https://fdc.nal.usda.gov/faq.html#q1) * hpo gene-to-pheno: can't find any info on their process. Info on [webpage's](https://hpo.jax.org/app/data/annotation-format) "genes_to_phenotype.txt format" section is vague. * monarch: not sure what to pick - depends on underlying primary source? [And they may add their own KL/AT assignments](https://github.com/biothings/biothings_explorer/issues/792#issuecomment-2044133771)… * pharmgkb: when the relationship wasn't listed [here](https://www.pharmgkb.org/page/faqs#what-data-are-manually-curated) as manually-curated. Then I couldn't tell how this assertion was made Unsure: * [bindingdb](https://www.bindingdb.org/): could count as `manual_agent`(website shows ~half the data is "curated")? But I picked `manual_validation_of_automated_agent` based on the line in ["Data Collection" section](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702793/): > Data imported from other databases, such as PubChem and ChEMBL, are automatically checked for completeness and certain easily detected errors, and any data flagged by these procedures are reviewed manually and corrected if needed. * dgidb: seems to use automated pipeline to ingest many resources (ref: [2021 paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778926/), VS [2024 paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10767982/) is more vague). So I picked `automated_agent`... * ebi-proteins uniprot-to-rhea: I'm assuming we are primarily using Swiss-Prot entries, which are human-curated ([ref](https://www.rhea-db.org/help/enzyme%2Dcatalyzing%2Dreaction)). But Trembl would be `automated_agent`... * iPTMnet: some info seems to be text-mined, vs imported from curated databases (ref: [paper materials and methods](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753337/#SEC2title)). So I put `automated_agent` * pharmgkb: assuming `manual_agent`. But it's unclear what info in pharmgkb isn't manually curated. There is a [list](https://www.pharmgkb.org/page/faqs#what-data-are-manually-curated) of what is * rampDB: I put `automated_agent`. currently only looking at pathway info, which seems to come from automated pipeline importing from multiple resources: HMDB, KEGG, WikiPathways, Reactome. Plus some manual curation for chemical/metabolite ID mappings.

colleenXu commented 5 months ago

@tokebe

I've updated the posts above since all the x-bte annotation work is done.

The rest of step 1 (ingesting/formatting the x-bte fields) + step 2 are yours?

colleenXu commented 4 months ago

@tokebe

For the KL/AT edge-attributes from x-bte annotation...

the edge-attribute types are missing the biolink prefixes ("biolink:knowledge_level", "biolink:agent_type")
the values are 1-element arrays, when we want them to be strings.

The format for the constructed edges looks correct/good. I saw examples of all 3 cases.

(Based on a quick review only)

tokebe commented 4 months ago

Latest commits should fix these.

tokebe commented 4 months ago

Related https://github.com/biothings/biothings_explorer/issues/715 could be done after this issue is reasonably done.

colleenXu commented 3 months ago

Update on Monarch (earlier comment in this issue):

I've updated the KL/AT assigments for Monarch API operations, using the info provided in https://github.com/monarch-initiative/monarch-app/issues/675#issuecomment-2138298559. So we're good for now!

colleenXu commented 2 months ago

The code was deployed today to Prod as part of the Octopus release. I tested and it's live.

I'm closing this issue because our side of the work is done. However, note that Text-Mining/Multiomics haven't updated their BioThings APIs for all instances to provide KL/AT edge-attributes yet (was keeping notes in a comment here)

biothings / biothings_explorer