There seems to be 3 different requirement sets at play, that we want to tell apart and be aware of:
"writer-friendly x-bte annotation":
easy to write/teach/maintain, can write manually (without using code or UI)
shouldn't be completely like code
has clear expectations for format / allowed values / what everything is used for
flexible, expressive
not dependent on specific TRAPI / biolink-model stuff that's still in-flux
"internal BTE use": what BTE needs to keep track of all the info, construct sub-queries, edge management, etc. (vocab: BTE MetaEdge, MetaXEdge, bteEdge...)
x-bte annotation may be too "collapsed" from this POV, and BTE will need to expand 1 operation -> multiple internal representations
x-bte annotation may be too verbose/specific from this POV, and this'll need to collapse multiple operations -> 1 MetaEdge for its purpose
Which leads to specific questions for group discussion, like:
How does "1 x-bte operation / unit of annotation" relate to similar concepts (MetaEdges?) in BTE and SmartAPI Registry MetaKG?
and how does x-bte refactoring relate to and potentially change this?
are BTE and SmartAPI Registry MetaKG using the same code? Does that make sense or should they use different code to process x-bte annotations?
And some ideas on how to "expand" an x-bte operation/ unit of annotation
Currently, 1 x-bte operation represents...
* 1 API endpoint being used
* 1 unique combo of:
* input semantic-type
* input ID namespace
* sub-query information
* predicate
* qualifier-set
* source field value
* output semantic-type
* output ID namespace
Jackson @tokebe and I have discussed how to make it easier to write x-bte annotation - and one of our ideas is to have 1 x-bte operation (one unit of annotation?) expand to include more info:
first-step proposal is #748
since there can be "combinatorial explosions" of current operations where the main difference comes from the input/output ID namespaces
Other sources of "combinatorial explosions" are:
unique qualifier-sets
unique source field values
note that all of these aren't as easy as "list out the possible values". There can be sub-query info, response-mapping info, post-processing info differences based on unique value/set...
my qualifier-set thinking
There are theoretically many operations that would mainly differ by qualifier-set (and how that affects sub-query info like post_filter/filter, jmespath, JQ).
The guidance for [anatomical](https://github.com/biolink/biolink-model/blob/db44be0c49939229c28cbb71a715127941e0ce0b/biolink-model.yaml#L1515) / [species](https://github.com/biolink/biolink-model/blob/db44be0c49939229c28cbb71a715127941e0ce0b/biolink-model.yaml#L1532) / and [population](https://github.com/biolink/biolink-model/blob/db44be0c49939229c28cbb71a715127941e0ce0b/biolink-model.yaml#L1158) context qualifiers is currently unclear to me (are they edge-attributes or part of the qualifier-set?). If they turn out to be part of the qualifier-set and we want to suppor them, this has combinatorial explosion problems because the context qualifiers in our KPs have a lot of possible values.
* anatomical context:
* multiomics apis (drug response): Guangrong has previously told me that some operations are affected, and include 10-20s of possible tissue/anatomical-context values
* also in pending apis: ebi gene2pheno
* species context: affects lots of apis
* core biothings: MyChem chembl.drug_mechanism and drugcentral.bioactivity info, MyGene panther, a little MyDisease disgenet)
* pending biothings: bindingdb, mgi gene 2pheno
* external: ctd, biolink/monarch
* population context:
* multiomics apis based on clinical data: ehr risk, wellness (clinical trials too?)
My source field thinking
There are theoretically some operations that would mainly differ by source (and how that affects sub-query info like post_filter/filter, jmespath, JQ...).
It would be nice if we could set the source info to field values that are post-processed by BTE...
I'm not sure of the scope of this issue though:
* core biothings apis: mygene, mydisease disgenet
* external apis: biolink/monarch
Also maybe complicated because some api hits will have multiple source values / fields?
The issues
There seems to be 3 different requirement sets at play, that we want to tell apart and be aware of:
Which leads to specific questions for group discussion, like:
And some ideas on how to "expand" an x-bte operation/ unit of annotation
Currently, 1 x-bte operation represents...
* 1 API endpoint being used * 1 unique combo of: * input semantic-type * input ID namespace * sub-query information * predicate * qualifier-set * source field value * output semantic-type * output ID namespace
Jackson @tokebe and I have discussed how to make it easier to write x-bte annotation - and one of our ideas is to have 1 x-bte operation (one unit of annotation?) expand to include more info:
my qualifier-set thinking
There are theoretically many operations that would mainly differ by qualifier-set (and how that affects sub-query info like post_filter/filter, jmespath, JQ). The guidance for [anatomical](https://github.com/biolink/biolink-model/blob/db44be0c49939229c28cbb71a715127941e0ce0b/biolink-model.yaml#L1515) / [species](https://github.com/biolink/biolink-model/blob/db44be0c49939229c28cbb71a715127941e0ce0b/biolink-model.yaml#L1532) / and [population](https://github.com/biolink/biolink-model/blob/db44be0c49939229c28cbb71a715127941e0ce0b/biolink-model.yaml#L1158) context qualifiers is currently unclear to me (are they edge-attributes or part of the qualifier-set?). If they turn out to be part of the qualifier-set and we want to suppor them, this has combinatorial explosion problems because the context qualifiers in our KPs have a lot of possible values. * anatomical context: * multiomics apis (drug response): Guangrong has previously told me that some operations are affected, and include 10-20s of possible tissue/anatomical-context values * also in pending apis: ebi gene2pheno * species context: affects lots of apis * core biothings: MyChem chembl.drug_mechanism and drugcentral.bioactivity info, MyGene panther, a little MyDisease disgenet) * pending biothings: bindingdb, mgi gene 2pheno * external: ctd, biolink/monarch * population context: * multiomics apis based on clinical data: ehr risk, wellness (clinical trials too?)
My source field thinking
There are theoretically some operations that would mainly differ by source (and how that affects sub-query info like post_filter/filter, jmespath, JQ...). It would be nice if we could set the source info to field values that are post-processed by BTE... I'm not sure of the scope of this issue though: * core biothings apis: mygene, mydisease disgenet * external apis: biolink/monarch Also maybe complicated because some api hits will have multiple source values / fields?
(ref for this issue: previous discussion notes in https://github.com/biothings/biothings_explorer/issues/656)