x-bte-refactoring: multiple input/output ID namespaces #748

Open colleenXu opened 11 months ago

colleenXu commented 11 months ago

The motivation

(This is a specific problem that involves large-scale x-bte refactoring. Let's discuss the "large issues" of x-bte refactoring one at a time.)

When writing x-bte operations, there's been many cases where the core difference is the input and/or output ID namespace and everything else in the "metatriple" stays the same (subject semantic type, predicate, qualifier-set, source, object semantic type). This causes "repetition" with only-slightly differing operations and in some cases a combinatorial explosion in the number of operations to write or maintain.

This affects ALL kinds of APIs we write x-bte annotation for

* all core BioThings APIs. (ex: [MyDisease disease <-> pheno](, [MyGene pathway <-> gene](, [MyChem chembl-drug-mech](, [MyVariant clinvar disease namespaces]( * big issue for some pending BioThings APIs: * [semmeddb may have hundreds of added operations to cover ncbigene (some of which should have matching umls operations)]( * a big issue for two Multiomics KPs ([Wellness]( and [EHR Risk]( causing hundreds of operations to generate and maintain * happens for some external apis too * [CTD]( (although not live right now because there's multiple things going on to other issue)) * [RaMP]( (although not written yet because of other stuff in the issue)

(came out of many discussions, some documented in

colleenXu commented 11 months ago

Initial thoughts

This isn't as simple as listing all the ID namespaces:

Stuff not included in this proposal, but I'm thinking over

A. The following proposal handles both "multiple input" and "multiple output" namespaces, but I wonder if handling just 1 (only "multiple outputs"?) is easier B: I wonder if the "multiple output namespaces" can be handled in post-processing, so only 1 sub-query is needed to get the info for all output namespaces C. One feature that may be nice for both inputs and outputs is a flag to toggle "multiple namespace prioritization" behavior. This is an issue with BioThings TTD output namespaces, currently handled by using different request info (see example 4 below). But I don't know how possible this is...(it may be more possible if B is implemented...) * prioritize order that namespaces are listed in operation. Example for outputs: for each response hit, look for which output namespace field is in the response "hit" (in the order of namespaces in the operation). Once a namespace field is found, stop. * VS no priority: all namespaces should be queried (input) / looked for in the response (output)

First proposal

  1. unnest operation contents. Right now every x-bte-kgs-operation object is a one-element array, where that element is the main contents. I don't see a need for an array here. (fairly minor change?)
  2. inputs: make this an object, with semantic (-> semanticType) field held separately from id (-> namespaces) info. Also include input_name info (-> inputs.namespaces.name_field)
  3. requestInfo: includes all sub-query-construction info (requestBody, requestBodyObject, parameters). Uses tags / structure to support when the "multiple input namespaces have different sub-query info" OR the "multiple output namespaces have different sub-query info". But not supporting both in the same operation
  4. outputs: same as inputs, except also including response field for the output ID (-> outputs.namespaces.id_field) and output_name (-> outputs.namespaces.name_field)
  5. response-mapping: only holds edge-attributes / trapi_sources handling. So it's now an optional field (sometimes we don't have that info)


1: multiple input namespaces that can be queried together (not different sub-query info)

* has different input `name_field`s * doesn't have response-mapping * can query the ID namespaces together because they have unique "patterns for local unique identifiers" * KEGG.PATHWAY:hsa00120 ([looks like three lower-case letters, then numeric?]( * WIKIPATHWAYS:WP2034 ([looks like WP then numeric?]( * BIOCARTA:raspathway (looks like all lowercase? [bioregisty]( entry looks different) (Based on MyGene's [`PathwayHasGene2`](, `PathwayHasGene3`, `PathwayHasGene4` operations. Not including `PathwayHasGene1` because it has a different source) ``` x-bte-kgs-operations: cpdb-PathwayHasGene: supportBatch: true useTemplating: true inputs: semanticType: Pathway namespaces: - prefix: "KEGG.PATHWAY" name_field: - prefix: WIKIPATHWAYS name_field: - prefix: BIOCARTA name_field: requestInfo: differsByInputNamespace: false differsByOutputNamespace: false requestBody: body: q: "{{ queryInputs }}" scopes:,, parameters: fields: entrezgene,,, species: human size: 1000 outputs: semanticType: Pathway namespaces: - prefix: NCBIGene id_field: entrezgene predicate: has_participant source: "infores:cpdb" ## NO RESPONSE MAPPING: no edge-attributes ```

2: multiple input namespaces that are queried separately (diff sub-query info)

* Must query separately because [OMIM]( and [ORPHANET]( IDs can be mistaken for each other: they have the same "pattern for local unique identifiers" (numeric) * no input name field or output name field info * still has response-mapping for edge-attributes (Based on MyDisease's [`disease-phenotype`](, `disease-phenotype2`) ``` x-bte-kgs-operations: disease-phenotype: supportBatch: true useTemplating: true inputs: semanticType: Disease namespaces: - prefix: OMIM - prefix: ORPHANET requestInfo: differsByInputNamespace: true differsByOutputNamespace: false byInputNamespace: OMIM: requestBody: body: q: "{{ queryInputs }}" scopes: hpo.omim ## using $ref to make less repetitive parameters: "$ref": "#/components/x-bte-refs/disease-phenotype-parameters" ORPHANET: requestBody: body: q: "{{ queryInputs }}" scopes: hpo.orphanet parameters: "$ref": "#/components/x-bte-refs/disease-phenotype-parameters" outputs: semanticType: PhenotypicFeature namespaces: - prefix: HP id_field: hpo.phenotype_related_to_disease.hpo_id predicate: has_phenotype source: "infores:hpo-annotations" response_mapping: "$ref": "#/components/x-bte-response-mapping/disease-phenotype" x-bte-refs: disease-pheno-parameters: fields: >- hpo.phenotype_related_to_disease.hpo_id, hpo.phenotype_related_to_disease.pmid_refs, hpo.phenotype_related_to_disease.isbn_refs, hpo.phenotype_related_to_disease.website_refs, hpo.phenotype_related_to_disease.numeric_freq, hpo.phenotype_related_to_disease.hp_freq, hpo.phenotype_related_to_disease.freq_numerator, hpo.phenotype_related_to_disease.freq_denominator x-bte-response-mapping: disease-phenotype: ref_pmid: hpo.phenotype_related_to_disease.pmid_refs ref_isbn: hpo.phenotype_related_to_disease.isbn_refs ref_url: hpo.phenotype_related_to_disease.website_refs "biolink:has_quotient": hpo.phenotype_related_to_disease.numeric_freq "biolink:frequency_qualifier": hpo.phenotype_related_to_disease.hp_freq "biolink:has_count": hpo.phenotype_related_to_disease.freq_numerator "biolink:has_total": hpo.phenotype_related_to_disease.freq_denominator ```

3: multiple output namespaces (not different sub-query info)

* Reverse of example 1 * `name_field` used for outputs * doesn't have response-mapping (Based on MyGene's [`involvedInPathway2`](, `involvedInPathway3`, `involvedInPathway4` operations. Not including `involvedInPathway1` because it has a different source) ``` x-bte-kgs-operations: cpdb-involvedInPathway: supportBatch: true useTemplating: true inputs: semanticType: Gene namespaces: - prefix: NCBIGene requestInfo: differsByInputNamespace: false differsByOutputNamespace: false requestBody: body: q: "{{ queryInputs }}" scopes: entrezgene parameters: fields: >-,,,,, species: human size: 1000 outputs: semanticType: Pathway namespaces: - prefix: "KEGG.PATHWAY" id_field: name_field: - prefix: WIKIPATHWAYS id_field: name_field: - prefix: BIOCARTA id_field: name_field: predicate: participates_in source: "infores:cpdb" ## NO RESPONSE MAPPING ```

4: multiple input namespaces AND output namespaces, different sub-query info for outputs

* 2 input namespaces: "PUBCHEM.COMPOUND" and "TTD.DRUG". They have different "patterns for local unique identifiers": * PUBCHEM.COMPOUND:139600308 ([numeric]( * TTD.DRUG:DZJ3D5 (has letters and numbers, [bioregisty]( entry looks different...) * 2 output namespaces: MONDO and ICD11 * using [post_filter]( so sub-query info won't differ by input namespace * but output namespaces do have different sub-query info (parameters) * doesn't have response-mapping (Based on BioThings TTD operations: [`pubchem_treats_mondo`](, `pubchem_treats_icd11`, `ttd_drug_id_treats_mondo`, `ttd_drug_id_treats_icd11`) ``` x-bte-kgs-operations: chemical-treats-disease: supportBatch: true useTemplating: true inputs: semanticType: SmallMolecule namespaces: - prefix: "PUBCHEM.COMPOUND" name_field: - prefix: "TTD.DRUG" name_field: requestInfo: differsByInputNamespace: false differsByOutputNamespace: true byOutputNamespace: MONDO: requestBody: "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease" parameters: fields: object.mondo,, post_filter: 'association.predicate:"biolink:treats"' size: 1000 ICD11: requestBody: "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease" parameters: fields: object.icd11,, post_filter: 'association.predicate:"biolink:treats" AND (NOT _exists_:object.mondo)' size: 1000 outputs: semanticType: Disease namespaces: - prefix: MONDO id_field: object.mondo name_field: - prefix: ICD11 id_field: object.icd11 name_field: predicate: treats source: "infores:ttd" ## NO RESPONSE MAPPING x-bte-refs: requestBody-chemTreatsDisease: body: q: "{{ queryInputs }}" scopes: subject.pubchem_compound,subject.ttd_drug_id ```

5: multiple input namespaces AND output namespaces, different sub-query info for inputs

* 3 input ID namespaces: HP, NCIT, SNOMEDCT * must query separately because [HP]( and [SNOMEDCT]( IDs can be mistaken for each other (both are numeric) -> different request bodies * using [post_filter]( for cleaner text * 3 output ID namespaces: MONDO, NCIT, SNOMEDCT Based on 6 current Multiomics EHR Risk operations (since 3 combinations don't actually exist in the data...) * [`PhenoHP_increased_DiseaseMONDO`]( * `PhenoHP_increased_DiseaseNCIT` * `PhenoHP_increased_DiseaseSNOMEDCT` * `PhenoNCIT_increased_DiseaseMONDO` * `PhenoSNOMEDCT_increased_DiseaseMONDO` * `PhenoSNOMEDCT_increased_DiseaseSNOMEDCT` ``` x-bte-kgs-operations: pheno-increased-disease: supportBatch: true useTemplating: true inputs: semanticType: PhenotypicFeature namespaces: - prefix: HP name_field: - prefix: NCIT name_field: - prefix: SNOMEDCT name_field: requestInfo: differsByInputNamespace: true differsByOutputNamespace: false byInputNamespace: HP: requestBody: "$ref": "#/components/x-bte-refs/requestInfo_HP" parameters: "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease" NCIT: requestBody: "$ref": "#/components/x-bte-refs/requestInfo_NCIT" parameters: "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease" SNOMEDCT: requestBody: "$ref": "#/components/x-bte-refs/requestInfo_SNOMEDCT" parameters: "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease" outputs: semanticType: Disease namespaces: - prefix: MONDO id_field: object.MONDO name_field: - prefix: NCIT id_field: object.NCIT name_field: - prefix: SNOMEDCT id_field: object.SNOMEDCT name_field: predicate: has_real_world_evidence_of_association_with qualifiers: object_direction_qualifier: increased object_aspect_qualifier: likelihood response_mapping: "$ref": "#/components/x-bte-response-mapping/edge-info" x-bte-response-mapping: edge-info: edge-attributes: association.edge_attributes trapi_sources: source.edge_sources x-bte-refs: requestInfo_HP: requestBody: body: q: "{{ queryInputs | rmPrefix() }}" scopes: subject.HP requestInfo_NCIT: requestBody: body: q: "{{ queryInputs | rmPrefix() }}" scopes: subject.NCIT requestInfo_SNOMEDCT: requestBody: body: q: "{{ queryInputs | rmPrefix() }}" scopes: subject.SNOMEDCT params_phenoIncreasedDisease: fields: >- object.MONDO,object.NCIT,object.SNOMEDCT, association.edge_attributes,source.edge_sources,, size: 1000 post_filter: >- subject.type:"biolink:PhenotypicFeature" AND association.predicate:associated_with_increased_likelihood_of AND object.type:"biolink:Disease" ```

rjawesome commented 10 months ago

I set up this proposal in the multiple-input-output branch using the smartapi-kg and api-respone-transform.js repositories