biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

x-bte-refactoring: multiple input/output ID namespaces #748

Open colleenXu opened 11 months ago

colleenXu commented 11 months ago

The motivation

(This is a specific problem that involves large-scale x-bte refactoring. Let's discuss the "large issues" of x-bte refactoring one at a time.)

When writing x-bte operations, there's been many cases where the core difference is the input and/or output ID namespace and everything else in the "metatriple" stays the same (subject semantic type, predicate, qualifier-set, source, object semantic type). This causes "repetition" with only-slightly differing operations and in some cases a combinatorial explosion in the number of operations to write or maintain.

This affects ALL kinds of APIs we write x-bte annotation for

* all core BioThings APIs. (ex: [MyDisease disease <-> pheno](https://github.com/NCATS-Tangerine/translator-api-registry/blob/478c079d20cebb8396a244f45809f2eebbe22a64/mydisease.info/smartapi.yaml#L321), [MyGene pathway <-> gene](https://github.com/NCATS-Tangerine/translator-api-registry/blob/478c079d20cebb8396a244f45809f2eebbe22a64/mygene.info/openapi_full.yml#L357), [MyChem chembl-drug-mech](https://github.com/NCATS-Tangerine/translator-api-registry/blob/478c079d20cebb8396a244f45809f2eebbe22a64/mychem.info/openapi_full.yml#L333), [MyVariant clinvar disease namespaces](https://github.com/biothings/biothings_explorer/issues/548)) * big issue for some pending BioThings APIs: * [semmeddb may have hundreds of added operations to cover ncbigene (some of which should have matching umls operations)](https://github.com/biothings/biothings_explorer/issues/644#issuecomment-1669124387) * a big issue for two Multiomics KPs ([Wellness](https://github.com/Hadlock-Lab/multiomics_wellness_kp/blob/6f550fdc17498198bde8e8bcdd7bd1fc1814323e/multiomics_wellness.yaml#L3105) and [EHR Risk](https://github.com/Hadlock-Lab/clinical_risk_kp/blob/090992831d4309d942970f86238b287acb27c452/ehr_risk_kp.yaml#L361) causing hundreds of operations to generate and maintain * happens for some external apis too * [CTD](https://github.com/biothings/biothings_explorer/issues/585) (although not live right now because there's multiple things going on to other issue)) * [RaMP](https://github.com/biothings/biothings_explorer/issues/705#issuecomment-1692284797) (although not written yet because of other stuff in the issue)

(came out of many discussions, some documented in https://github.com/biothings/biothings_explorer/issues/656)

colleenXu commented 11 months ago

Initial thoughts

This isn't as simple as listing all the ID namespaces:

Stuff not included in this proposal, but I'm thinking over

A. The following proposal handles both "multiple input" and "multiple output" namespaces, but I wonder if handling just 1 (only "multiple outputs"?) is easier B: I wonder if the "multiple output namespaces" can be handled in post-processing, so only 1 sub-query is needed to get the info for all output namespaces C. One feature that may be nice for both inputs and outputs is a flag to toggle "multiple namespace prioritization" behavior. This is an issue with BioThings TTD output namespaces, currently handled by using different request info (see example 4 below). But I don't know how possible this is...(it may be more possible if B is implemented...) * prioritize order that namespaces are listed in operation. Example for outputs: for each response hit, look for which output namespace field is in the response "hit" (in the order of namespaces in the operation). Once a namespace field is found, stop. * VS no priority: all namespaces should be queried (input) / looked for in the response (output)

First proposal

  1. unnest operation contents. Right now every x-bte-kgs-operation object is a one-element array, where that element is the main contents. I don't see a need for an array here. (fairly minor change?)
  2. inputs: make this an object, with semantic (-> semanticType) field held separately from id (-> namespaces) info. Also include input_name info (-> inputs.namespaces.name_field)
  3. requestInfo: includes all sub-query-construction info (requestBody, requestBodyObject, parameters). Uses tags / structure to support when the "multiple input namespaces have different sub-query info" OR the "multiple output namespaces have different sub-query info". But not supporting both in the same operation
  4. outputs: same as inputs, except also including response field for the output ID (-> outputs.namespaces.id_field) and output_name (-> outputs.namespaces.name_field)
  5. response-mapping: only holds edge-attributes / trapi_sources handling. So it's now an optional field (sometimes we don't have that info)

Examples

1: multiple input namespaces that can be queried together (not different sub-query info)

* has different input `name_field`s * doesn't have response-mapping * can query the ID namespaces together because they have unique "patterns for local unique identifiers" * KEGG.PATHWAY:hsa00120 ([looks like three lower-case letters, then numeric?](https://bioregistry.io/registry/kegg.pathway)) * WIKIPATHWAYS:WP2034 ([looks like WP then numeric?](https://bioregistry.io/registry/wikipathways)) * BIOCARTA:raspathway (looks like all lowercase? [bioregisty](https://bioregistry.io/registry/biocarta.pathway) entry looks different) (Based on MyGene's [`PathwayHasGene2`](https://github.com/NCATS-Tangerine/translator-api-registry/blob/478c079d20cebb8396a244f45809f2eebbe22a64/mygene.info/openapi_full.yml#L743), `PathwayHasGene3`, `PathwayHasGene4` operations. Not including `PathwayHasGene1` because it has a different source) ``` x-bte-kgs-operations: cpdb-PathwayHasGene: supportBatch: true useTemplating: true inputs: semanticType: Pathway namespaces: - prefix: "KEGG.PATHWAY" name_field: pathway.kegg.name - prefix: WIKIPATHWAYS name_field: pathway.wikipathways.name - prefix: BIOCARTA name_field: pathway.biocarta.name requestInfo: differsByInputNamespace: false differsByOutputNamespace: false requestBody: body: q: "{{ queryInputs }}" scopes: pathway.kegg.id,pathway.wikipathways.id,pathway.biocarta.id parameters: fields: entrezgene,pathway.kegg.name,pathway.wikipathways.name,pathway.biocarta.name species: human size: 1000 outputs: semanticType: Pathway namespaces: - prefix: NCBIGene id_field: entrezgene predicate: has_participant source: "infores:cpdb" ## NO RESPONSE MAPPING: no edge-attributes ```

2: multiple input namespaces that are queried separately (diff sub-query info)

* Must query separately because [OMIM](https://bioregistry.io/registry/omim) and [ORPHANET](https://bioregistry.io/registry/orphanet) IDs can be mistaken for each other: they have the same "pattern for local unique identifiers" (numeric) * no input name field or output name field info * still has response-mapping for edge-attributes (Based on MyDisease's [`disease-phenotype`](https://github.com/NCATS-Tangerine/translator-api-registry/blob/478c079d20cebb8396a244f45809f2eebbe22a64/mydisease.info/smartapi.yaml#L774), `disease-phenotype2`) ``` x-bte-kgs-operations: disease-phenotype: supportBatch: true useTemplating: true inputs: semanticType: Disease namespaces: - prefix: OMIM - prefix: ORPHANET requestInfo: differsByInputNamespace: true differsByOutputNamespace: false byInputNamespace: OMIM: requestBody: body: q: "{{ queryInputs }}" scopes: hpo.omim ## using $ref to make less repetitive parameters: "$ref": "#/components/x-bte-refs/disease-phenotype-parameters" ORPHANET: requestBody: body: q: "{{ queryInputs }}" scopes: hpo.orphanet parameters: "$ref": "#/components/x-bte-refs/disease-phenotype-parameters" outputs: semanticType: PhenotypicFeature namespaces: - prefix: HP id_field: hpo.phenotype_related_to_disease.hpo_id predicate: has_phenotype source: "infores:hpo-annotations" response_mapping: "$ref": "#/components/x-bte-response-mapping/disease-phenotype" x-bte-refs: disease-pheno-parameters: fields: >- hpo.phenotype_related_to_disease.hpo_id, hpo.phenotype_related_to_disease.pmid_refs, hpo.phenotype_related_to_disease.isbn_refs, hpo.phenotype_related_to_disease.website_refs, hpo.phenotype_related_to_disease.numeric_freq, hpo.phenotype_related_to_disease.hp_freq, hpo.phenotype_related_to_disease.freq_numerator, hpo.phenotype_related_to_disease.freq_denominator x-bte-response-mapping: disease-phenotype: ref_pmid: hpo.phenotype_related_to_disease.pmid_refs ref_isbn: hpo.phenotype_related_to_disease.isbn_refs ref_url: hpo.phenotype_related_to_disease.website_refs "biolink:has_quotient": hpo.phenotype_related_to_disease.numeric_freq "biolink:frequency_qualifier": hpo.phenotype_related_to_disease.hp_freq "biolink:has_count": hpo.phenotype_related_to_disease.freq_numerator "biolink:has_total": hpo.phenotype_related_to_disease.freq_denominator ```

3: multiple output namespaces (not different sub-query info)

* Reverse of example 1 * `name_field` used for outputs * doesn't have response-mapping (Based on MyGene's [`involvedInPathway2`](https://github.com/NCATS-Tangerine/translator-api-registry/blob/478c079d20cebb8396a244f45809f2eebbe22a64/mygene.info/openapi_full.yml#L844C5-L844C23), `involvedInPathway3`, `involvedInPathway4` operations. Not including `involvedInPathway1` because it has a different source) ``` x-bte-kgs-operations: cpdb-involvedInPathway: supportBatch: true useTemplating: true inputs: semanticType: Gene namespaces: - prefix: NCBIGene requestInfo: differsByInputNamespace: false differsByOutputNamespace: false requestBody: body: q: "{{ queryInputs }}" scopes: entrezgene parameters: fields: >- pathway.kegg.id,pathway.kegg.name, pathway.wikipathways.id,pathway.wikipathways.name, pathway.biocarta.id,pathway.biocarta.name species: human size: 1000 outputs: semanticType: Pathway namespaces: - prefix: "KEGG.PATHWAY" id_field: pathway.kegg.id name_field: pathway.kegg.name - prefix: WIKIPATHWAYS id_field: pathway.wikipathways.id name_field: pathway.wikipathways.name - prefix: BIOCARTA id_field: pathway.biocarta.id name_field: pathway.biocarta.name predicate: participates_in source: "infores:cpdb" ## NO RESPONSE MAPPING ```

4: multiple input namespaces AND output namespaces, different sub-query info for outputs

* 2 input namespaces: "PUBCHEM.COMPOUND" and "TTD.DRUG". They have different "patterns for local unique identifiers": * PUBCHEM.COMPOUND:139600308 ([numeric](https://bioregistry.io/registry/pubchem.compound)) * TTD.DRUG:DZJ3D5 (has letters and numbers, [bioregisty](https://bioregistry.io/registry/ttd.drug) entry looks different...) * 2 output namespaces: MONDO and ICD11 * using [post_filter](https://github.com/biothings/biothings_explorer/issues/726) so sub-query info won't differ by input namespace * but output namespaces do have different sub-query info (parameters) * doesn't have response-mapping (Based on BioThings TTD operations: [`pubchem_treats_mondo`](https://github.com/NCATS-Tangerine/translator-api-registry/blob/478c079d20cebb8396a244f45809f2eebbe22a64/ttd/smartapi.yaml#L735), `pubchem_treats_icd11`, `ttd_drug_id_treats_mondo`, `ttd_drug_id_treats_icd11`) ``` x-bte-kgs-operations: chemical-treats-disease: supportBatch: true useTemplating: true inputs: semanticType: SmallMolecule namespaces: - prefix: "PUBCHEM.COMPOUND" name_field: subject.name - prefix: "TTD.DRUG" name_field: subject.name requestInfo: differsByInputNamespace: false differsByOutputNamespace: true byOutputNamespace: MONDO: requestBody: "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease" parameters: fields: object.mondo,object.name,subject.name post_filter: 'association.predicate:"biolink:treats"' size: 1000 ICD11: requestBody: "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease" parameters: fields: object.icd11,object.name,subject.name post_filter: 'association.predicate:"biolink:treats" AND (NOT _exists_:object.mondo)' size: 1000 outputs: semanticType: Disease namespaces: - prefix: MONDO id_field: object.mondo name_field: object.name - prefix: ICD11 id_field: object.icd11 name_field: object.name predicate: treats source: "infores:ttd" ## NO RESPONSE MAPPING x-bte-refs: requestBody-chemTreatsDisease: body: q: "{{ queryInputs }}" scopes: subject.pubchem_compound,subject.ttd_drug_id ```

5: multiple input namespaces AND output namespaces, different sub-query info for inputs

* 3 input ID namespaces: HP, NCIT, SNOMEDCT * must query separately because [HP](https://bioregistry.io/registry/hp) and [SNOMEDCT](https://bioregistry.io/registry/snomedct) IDs can be mistaken for each other (both are numeric) -> different request bodies * using [post_filter](https://github.com/biothings/biothings_explorer/issues/726) for cleaner text * 3 output ID namespaces: MONDO, NCIT, SNOMEDCT Based on 6 current Multiomics EHR Risk operations (since 3 combinations don't actually exist in the data...) * [`PhenoHP_increased_DiseaseMONDO`](https://github.com/Hadlock-Lab/clinical_risk_kp/blob/090992831d4309d942970f86238b287acb27c452/ehr_risk_kp.yaml#L396C50-L396C80) * `PhenoHP_increased_DiseaseNCIT` * `PhenoHP_increased_DiseaseSNOMEDCT` * `PhenoNCIT_increased_DiseaseMONDO` * `PhenoSNOMEDCT_increased_DiseaseMONDO` * `PhenoSNOMEDCT_increased_DiseaseSNOMEDCT` ``` x-bte-kgs-operations: pheno-increased-disease: supportBatch: true useTemplating: true inputs: semanticType: PhenotypicFeature namespaces: - prefix: HP name_field: subject.name - prefix: NCIT name_field: subject.name - prefix: SNOMEDCT name_field: subject.name requestInfo: differsByInputNamespace: true differsByOutputNamespace: false byInputNamespace: HP: requestBody: "$ref": "#/components/x-bte-refs/requestInfo_HP" parameters: "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease" NCIT: requestBody: "$ref": "#/components/x-bte-refs/requestInfo_NCIT" parameters: "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease" SNOMEDCT: requestBody: "$ref": "#/components/x-bte-refs/requestInfo_SNOMEDCT" parameters: "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease" outputs: semanticType: Disease namespaces: - prefix: MONDO id_field: object.MONDO name_field: object.name - prefix: NCIT id_field: object.NCIT name_field: object.name - prefix: SNOMEDCT id_field: object.SNOMEDCT name_field: object.name predicate: has_real_world_evidence_of_association_with qualifiers: object_direction_qualifier: increased object_aspect_qualifier: likelihood response_mapping: "$ref": "#/components/x-bte-response-mapping/edge-info" x-bte-response-mapping: edge-info: edge-attributes: association.edge_attributes trapi_sources: source.edge_sources x-bte-refs: requestInfo_HP: requestBody: body: q: "{{ queryInputs | rmPrefix() }}" scopes: subject.HP requestInfo_NCIT: requestBody: body: q: "{{ queryInputs | rmPrefix() }}" scopes: subject.NCIT requestInfo_SNOMEDCT: requestBody: body: q: "{{ queryInputs | rmPrefix() }}" scopes: subject.SNOMEDCT params_phenoIncreasedDisease: fields: >- object.MONDO,object.NCIT,object.SNOMEDCT, association.edge_attributes,source.edge_sources, subject.name,object.name size: 1000 post_filter: >- subject.type:"biolink:PhenotypicFeature" AND association.predicate:associated_with_increased_likelihood_of AND object.type:"biolink:Disease" ```

rjawesome commented 10 months ago

I set up this proposal in the multiple-input-output branch using the smartapi-kg and api-respone-transform.js repositories