andrewsu commented 1 year ago

Translator consortium target for implementation in Feb 2023. Exact TRAPI query templates will be created soon per Architecture meeting 2022-12-06...

EDIT Implementation dates:

Jan 25: Development
Feb 13: ITRB CI

colleenXu commented 1 year ago

Note:

it's not clear to me from this title if increases/decreases is a variable provided by the user or not...
I'm unsure right now how to do this "creatively"

Possibly relevant, from Slack:

Andy Crouse (Unsecret Agent, UI team) 1:13 PM This is the test set of genes and drugs for next creative mode query. Note there are some ‘destroyer of worlds’ genes and drugs (valproic acid, vitamin A, TP53, P450) So if the lights dim when testing that is why. But they should be good for pressure testing! I included some that are not classically druggable like none coding and transcription factors. So hopefully this will cover the gambit. https://docs.google.com/document/d/1X3XbHhS_AIGkSyxSaYUuiCqVNh33Wz7f3EEyKZEctL8/edit?usp=sharing

andrewsu commented 1 year ago

The proposed TRAPI query templates are defined in these issues:

tokebe commented 1 year ago

534 Is now addressed in add-qualifiers PRs, so integration testing is now possible.

colleenXu commented 1 year ago

Noting:

the two "increases" starting-TRAPI are identical, except for which QNode has an ID at the start. With our current implementation, this means we'll pick the same templateGroup for both starting-TRAPI. I think that's okay.
- However, the sub-query QEdge traversals will differ depending on the starting TRAPI (which QNode has an ID at the start). So the edge-attributes can differ due to the "reverse operation" issue - and maybe the number of nodes/edges/results...
same with the two "decreases" starting-TRAPI
Because we don't have qualifier-hierarchy traversal yet...we'll likely include the various qualifier-constraint-combos in our templates using qualifier-sets (OR) (FYI @tokebe )

tokebe commented 1 year ago

@colleenXu Presently, the qualifier-matching only supports one qualifier-set per templateGroup. Would you prefer I change the implementation to support multiple qualifier-sets in a templateGroup for OR matching within a single group (presently, this would be covered by making multiple templateGroups)?

Examples:

Presently:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": {
      "some_qualifier_type": "some_qualifier_value"
    }
  }

Proposed:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": [
      {
        "some_qualifier_type": "some_qualifier_value"
      },
      {
        "some_other_type": "this_would_be_OR"
      }
    ]
  }

colleenXu commented 1 year ago

I think using 1 qualifier-set for matching to templateGroup is fine because that's what we see in this issue (and near-future).

In the third bullet point of my post above, I'm saying that for OUR individual templates, I'll need to use multiple qualifier-sets to query for "activity" vs "activity_or_abundance" because we don't do qualifier-hierarchy expansion yet.

colleenXu commented 1 year ago

Knowledge resources related to this set of creative-mode questions (My analysis)

This is a work-in-progress, so edits will be done over time

Resources ready and BTE already uses

DGIdb
- 12 sets of chem - gene relationships (4 up, 4 down, 4 neither). However, the majority of the data (44557 records) has no relationship assigned (not_applicable)
text-mining
- has chem - gene (up/down)
- also has gene - gene (up/down)

Note that Multiomics APIs has some potentially relevant info but we have some concerns:

drug response: has gene-gene relationships, for negative and positive correlation. But not clear where this comes from or how to use this

Wellness: has chem-chem and chem-gene relationships, for correlation and related_to. But not clear where this comes from or how to use this

Want update and BTE already uses

Semmeddb

improve with a data update + generating operations that use the ncbigene fields
- update may be done soon? See internal lab Slack links. Will involve deprecated IDs/pipes/ncbigene handling/only having novelty=1 records
- When the update happens: we'll want to update the x-bte annotation ASAP-ish (for master + biolink3 branches). It'll help to have the post-API-deployment analysis of metatriples.....
- If I don't have that post-API-deployment analysis of metatriples...
  - if I know the semmeddb data file, I can quickly generate umls operations
  - generating ncbigene operations without the analysis of metatriples is possible but tricky?
    - check which subject/object semmeddb semantic-types have only the ncbigene field for ID. Kinda quick.
    - modify the final combos df to add rows when the metatriple has those subject/object semmeddb-semantic types

BindingDB

looking at the bindingdb website, we may be able to pick a more specific relationship for a chemical-gene pair
- can we get the assay description and find keywords in that text? For example, the assay description for this pair seems to say this chem is an agonist of this gene...
- can we use the existence of certain fields + the values of those fields? For example, if the IC50 field exists and its value is small, maybe this chem is an inhibitor of this gene.
  - The Ki field existence and value can maybe used as well. I'm not sure if other fields can be used.
  - the values are tricky because some are integers, some are floats, and some are integers ">40000"
info: see my old note here, listing fields here, bindingDB's info

Want new pending API

MyChem `chembl.drug_mechanisms` data, in subject-association-object format

We want 1 association per unique combo of subject/object/action_type.
- Right now in MyChem, all data is aggregated by unique chemical (chem-centric format). So when a chemical has multiple drug_mechanisms, we can't retrieve only the sections that have a specific action_type.
- MyChem currently has info for 5262 unique chemicals
We want the gene ID to be NCBIGene (or UniProtKB or ENSEMBL ENSG maybe).
- Right now, it's CHEMBL.TARGET and Translator's Node Normalizer doesn't have cross-mappings/name-retrieval for that ID namespace

MyChem `drugcentral.bioactivity` data, in subject-association-object format

We want 1 association per unique combo of subject/object/action_type.
- Right now in MyChem, all data is aggregated by unique chemical (chem-centric format). So when a chemical has multiple bioactivity, we can't retrieve only the sections that have a specific action_type.
- MyChem currently has info for 2772 unique chemicals
We want the gene ID to be UniProtKB (or NCBIGene or ENSEMBL ENSG maybe).
- Right now, it's UniProtKB but 119 chemicals lack the UniProtKB record...and their bioactivity info looks odd (not acting on genes?)

Not promising

expand to read

#### Bioplanet * within a pathway, there can be "activation" and "inhibition" arrows. However.... * I find these hard to understand (see [example](https://tripod.nih.gov/bioplanet/detail.jsp?pid=bioplanet_49&target=pathway)) * the actions may be done by complexes or done on complexes (not individual genes or chemicals) * not clear how many pathways involve exogenous chemicals... * I don't see an easy download/way of parsing these pathways to get "chemical-gene" associations (biopax?)

colleenXu commented 1 year ago

Results for the direct-increases.json template with Gene Inputs

Reasonable number of results | Gene Name | Gene ID | Time to run (s) | Number of results |Notes| | ------------| ---------| ------------|-------------------|------| | MAPK8IP3 | NCBIGene:23162 | 11 | 4 | but only 1 looks correct; Text-Mining | | P450 (CYP27B1?) | NCBIGene:1594 | 12 | 48 | semmeddb + text-mining | | XIST | NCBIGene:7503 | 12 | 4 | semmeddb | | ADNP | NCBIGene:23394 | 12 | 4 | semmeddb + text-mining | | MEF2C | NCBIGene:4208 | 12 | 48 | text-mining | | MALAT1 | NCBIGene:378938 | 11 | 3 | semmeddb | | DYRK1A | NCBIGene:1859 | 11 | 14 | text-mining | | PCSK1N | NCBIGene:27344 | 11 | 6 | semmeddb + text-mining | | TTR | NCBIGene:7276 | 12 | 91 | semmeddb + text-mining | | GNAS | NCBIGene:2778 | 11 | 9 | semmeddb | | IAPP | NCBIGene:3375 | 12 | 51 | text-mining | | SCG5 | NCBIGene:6447 | 11 | 1 | text-mining | | B2M | NCBIGene:567 | 12 | 66 | text-mining | | Cytochrome B | NCBIGene:4519 | 12 | 16 | text-mining | | RBP4 | NCBIGene:5950 | 12 | 16 | text-mining | | NGLY1 | NCBIGene:55768 | 11 | 2 | text-mining | | RHOBTB2 | NCBIGene:23221 | 11 | 3 | text-mining | | WDFY3 | NCBIGene:23001 | 12 | 5 | text-mining | | KCNT1 | NCBIGene:57582 | 11 | 5 | **DGIDB** + text-mining | | NALCN | NCBIGene:259232 | 11 | 1 | semmeddb | | CACNA1A | NCBIGene:773 | 11 | 2 | text-mining | | BICD2 | NCBIGene:23299 | 10 | 4 | semmeddb + text-mining | | FHL1 (CFH) | NCBIGene:3075 | 11 | 4 | semmeddb + text-mining | | SETBP1 | NCBIGene:26040 | 12 | 3 | text-mining | Many results (exploding) | Gene Name | Gene ID | Time to run (s) | Number of results |Notes| | ------------| ---------| ------------|-------------------|------| | TP53 | NCBIGene:7157 | 32 | 1744 | semmeddb + text-mining | | MMP9 | NCBIGene:4318 | 20 | 753 | semmeddb + text-mining | | GAPDH | NCBIGene:2597 | 15 | 220 | semmeddb + text-mining | | Cytochrome oxidase | NCBIGene:4512 | 14 | 177 | semmeddb + text-mining | | PPP3CA | NCBIGene:5530 | 14 | 182 | semmeddb + text-mining | | ATF3 | NCBIGene:467 | 14 | 226 | semmeddb + text-mining | No results | Gene Name | Gene ID | Time to run (s) | Number of results |Notes| | ------------| ---------| ------------|-------------------|------| | TSIX | NCBIGene:9383 | 13 | 0 | | | BORCS8-MEF2B | NCBIGene:4207 | 9 | 0 | | | FAU | NCBIGene:2197 | 10 | 0 | | | DHX30 | NCBIGene:22907 | 10 | 0 | | | SAMD9L | NCBIGene:219285 | 10 | 0 | | | Engase | NCBIGene:64772 | 10 | 0 | |

colleenXu commented 1 year ago

I found that 2 templateGroup items worked: one for Chem-increases-Gene and one for Chem-decreases-Gene (input ID can be on the Chem QNode or the Gene QNode). I wrote the templateGroups and the templates @andrewsu + I discussed on Monday and did a bit of testing (see next post). These are in this branch https://github.com/biothings/bte_trapi_query_graph_handler/pull/132

template ideas we decided to write up

- Chem (increases/decreases) gene (direct) - Chem increases another gene, then that gene (upregs/downregs) Gene (upregs -> overall `increase`, downregs -> overall `decrease`) - I modified QEdges to be more specific: removing `increased activity_or_abundance` for eA and `increased`/`decreased` `activity_or_abundance` for eB. This removed semmeddb and some dgidb edges... - Chem decreases another gene, then that gene (downregs/upregs) Gene (downregs -> overall `increase`, upregs -> overall `decrease`) - I modified QEdges to be more specific (removing `increased activity_or_abundance` for eA and `increased`/`decreased` `activity_or_abundance` for eB). This removed semmeddb and some dgidb edges... - Chem interacts with another gene, then that gene (upreg/downreg) gene - I modified QEdges to be more specific: using `physically_interacts_with` for eA and `increased`/`decreased` `activity_or_abundance` for eB. This removed semmeddb edges and a major dgidb edge... - Chem interacts with gene (direct) - I modified QEdges to be more specific: using `physically_interacts_with` for eA. This removed semmeddb edges and a major dgidb edge...

colleenXu commented 1 year ago

How I tested

Check out fix_issue_532 branch for query-handler and dev branches for all other modules (including bte-trapi-workspace). For the "user"/"from UI/ARS" query...the main skeleton of the query is this. What varies is (1) which QNode also has an ids field with 1 user-submitted ID and (2) whether the object_direction_qualifier value is increased or decreased (I put X there).

skeleton

``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "X" } ] } ] } } } } } ```

So here's an example of ChemX (sumatriptan) increases genes

``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"], "ids": ["PUBCHEM.COMPOUND:5358"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "increased" } ] } ] } } } } } ```

Discussed

EDIT: discussed below + during Wednesday 1/11 meeting. Decisions added to each point

@tokebe

After testing https://github.com/biothings/bte_trapi_query_graph_handler/pull/132 (it's branched off dev, and I tested it with all my other branches as dev)....overall it looks close to done. However, I have the following questions / issues:

Looking at the logs, I'm not sure if some results are "pruned / removed" incorrectly. A confounding / related issue is that I don't see "same results from multiple templates merging" console logs and the "result merging" TRAPI-logs are buggy. -> JC to look into this (aka issues with creative-mode)
- Using the chem sumatriptan PUBCHEM.COMPOUND:5358 is a good example (runs fairly fast, results from multiple templates and merging happens). See its entries in the Chem tables of the next post.
BTE isn't doing "exact qualifier-set" matching to templateGroups. Is that alright? -> CX asked Translator group (Translator private Slack link). This behavior is fine right now
- For example, if I send a query with an "inferred" "activity_or_abundance" QEdge (no increased/decreased direction qualifier), all 10 templates would run...

query with no direction qualifier so all (EDIT: 9) templates will be loaded

``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"], "ids": ["PUBCHEM.COMPOUND:5358"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" } ] } ] } } } } } ```

I noticed something odd with the "limiting execution" code - where the records accumulating for a QEdge seemed a lot higher than the stated limit...The console log said bte:call-apis:query QEdge eA obtained 59903 records, exceeding maximum of 30000. Skipping remaining 1 (0 planned/1 paged) queries for this edge. Your query may be too general? +0ms. -> It checks after each individual sub-query. Sometimes the excess is a lot! JC will adjust code to truncate and remove the records over the maximum (he'll decide how to implement).
- this log happened during the 4th template for Chem decreases GAPDH/NCBIGene:2597. It then was taking a very long time for the ID resolution (so I killed the execution manually).
- Something similar happens for increases GAPDH but there are between 30k-40k records there (still takes too long for ID resolution).
- See GAPDH's entries in the Gene tables of the next post.

Starting query for Chem decreases GeneY (GAPDH)

``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"], "ids": ["NCBIGene:2597"] }, "chemical": { "categories": ["biolink:ChemicalEntity"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "increased" } ] } ] } } } } } ```

Still having issues with the logs summarizing the execution of each QEdge: "eA03 execution: 0 queries (0 success/0 fail) and (0) cached qEdges return (80) records". I mentioned that here (lab's internal Slack) -> JC to look into this (aka issues with creative-mode)
I think we can consider drastically dropping the max number of results to return (maybe 200? 400?). -> we discussed and agreed to drop max number to 500 for all creative-mode (same as ARAX)

colleenXu commented 1 year ago

Testing

[reverted; this records the testing done 1/10]

Increases

Chem Name	Chem ID	Final results	Template breakdown and notes
metformin	PUBCHEM.COMPOUND:4091	1000	> 1000 on first template
palmitic acid	PUBCHEM.COMPOUND:12358543	0	hmmm...example of no results
sumatriptan	PUBCHEM.COMPOUND:5358	137	37, 4, 2, 85, 32 (160 sum). Results dropping or merging?

Gene Name	Gene ID	Final results	Template breakdown and notes
MAPK8IP3	NCBIGene:23162	4	4 from first template, none from the rest. Compare to `decreases` entry.
P450 (CYP27B1?)	NCBIGene:1594	832	48, 0, 0, 770, 18 (836 sum). Results dropping or merging?
MEF2C	NCBIGene:4208	1000	48, 1, 0, 6453 (stop).
GAPDH	NCBIGene:2597	presume 1000	220, 88, 112, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.

Decreases

Chem Name	Chem ID	Final results	Template breakdown and notes
metformin	PUBCHEM.COMPOUND:4091	1000	> 1000 on first template
palmitic acid	PUBCHEM.COMPOUND:12358543	0	hmmm...example of no results
sumatriptan	PUBCHEM.COMPOUND:5358	177	56, 19, 3, 132, 32 (242 sum). Results dropping or merging?

Gene Name	Gene ID	Final results	Template breakdown and notes
MAPK8IP3	NCBIGene:23162	761	7, 0, 0, 754, 0 (sum 761, so no dropping/merging)
P450 (CYP27B1?)	NCBIGene:1594	1000	31, 0, 0, 6694 (stop)
MEF2C	NCBIGene:4208	1000	34, 0, 0, 6923 (stop)
GAPDH	NCBIGene:2597	presume 1000	290, 120, 24, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.

tokebe commented 1 year ago

I'll have to take a closer look at issues with results merging/pruning.
I had previously stated that, currently, templates are matched purely by the query qualifier set being a subset of the templateGroup, rather than an exact match (i.e. both sets are equivalent, templateGroup has no additional qualifiers). I can change this to expecting an exact match instead?
the record cutoff is checked after each query. If a query returns an absurd number of records, it's possible to exceed the record cutoff by an absurd amount. Should we instead delete records above the cutoff? I'd like @andrewsu's opinion on this as well.
Looking into it.
I don't have a problem with this -- @andrewsu?

andrewsu commented 1 year ago

@colleenXu can you post the question to one of the translator slack channels? I can see arguments both ways, and I think this is something that should be standardized across ARAs.
Originally my impression was that if we have the results beyond our threshold, might as well keep them. But I hadn't thought of the fact that other downstream things need to happen (like ID resolution). If those downstream things are essentially preventing BTE from returning the partial results, then I vote for deleting excess records. If not (and the issue is only the oddity of exceeding our threshold by a significant margin), then I vote for keeping the excess records. EDIT: decision made during 2023-01-11 meeting to delete excess records.

5. I'd suggest 500 to match what ARAX returns.

tokebe commented 1 year ago

@colleenXu I've fixed the results merging count and logging. As a result, I can confirm that no results are being dropped. Running sumatriptan, I get 433 results added across the templates, and 244 results in the final response. Logging now shows 189 results that were combined. 433 - 189 = 244, everything checks out. I was actually running in circles for a while because I thought I had to account for the number of results merged into, but this is simply a count of results from the 244 that were combined from multiple templates.

I'm not sure what's different between our locals that cause a difference in numbers, (perhaps I haven't updated specs recently?), but all results are accounted for, and no code is dropping results.

I've also fixed issue 4 -- a piece of code in call-apis was still using pre-qedge-refactor accessors, breaking the counts.

Additionally, I've pushed changes so records are truncated to the cutoff, and changed the creative limit to 500.

All set to proceed with testing again 👍

colleenXu commented 1 year ago

@tokebe

Questions I have after my second round of testing:

Followup on Point 1 of this post: When I run a decreased query for sumatriptan (PUBCHEM.COMPOUND:5358), I notice that there are 234 final results but one of the templates reports getting 247 results (in the logs). I'm then confused...if we're not truncating results, how can a template get more results than there is at the final count?

TRAPI query for `decreased` sumatriptan + screenshot of logs

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "decreased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

I circled in orange the numbers mentioned above. Screen Shot 2023-01-19 at 3 50 51 PM

Another followup on Point 1: I'm still confused by the "merge" logs (TRAPI and console) 😖 since I don't understand how, when merging the results from two templates, the "merged result" and "final result" numbers can be larger than the result count for one of those templates. The chem estrogen's queries are good examples (see the collapsed section below for the TRAPI query / info.
- related question: I notice only 1 "merge" log at the end of creative-mode execution, even when we execute > 2 templates. However, it kinda looks like we merge previous results with new results after running each template - so it'd make more sense to put logs there (how many were merged / how many results there are so far). But maybe I'm incorrect?
- related question: it looks like we merge results before we truncate to 500. Just checking if my understanding is correct...

TRAPI query for `increased` estrogen + screenshot of logs

For `increases`, the estrogen query logs say 271 + 5692 results have 3409 results "merged" to "1001 final results". ``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"], "ids": ["PUBCHEM.COMPOUND:5991"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "increased" } ] } ] } } } } } ``` I circled in orange the numbers mentioned above. ![Screen Shot 2023-01-19 at 11 24 06 PM](https://user-images.githubusercontent.com/43731687/213640157-e97114a2-f0c4-4eca-9a59-ce5b7931d21d.png)

Followup on Point 3 of this post: my local BTE instance is still getting stuck on GAPDH queries (see notes in the "second round of testing" collapsed section, gene tables). Maybe we'll address this with optimizations later...but it makes me wonder if BTE will crash during some testing of these creative-modes, and if 30k records is still kinda high...

(Points 4-5 from the post have been addressed by the recent changes (yay), and point 2 is still pending / doesn't seem to be an issue)

colleenXu commented 1 year ago

Second round of testing

[EDITED 1/23-1/26 after is_set: true added to templates and fixes added to logging. Rearranged list to match original Translator issue curie lists]

I tested more chemicals and genes this time, including all chemicals listed in the Translator posts

Increased

starting with chem

| Chem Name | Chem ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|---------------------------------|-----| | Amphetamine | PUBCHEM.COMPOUND:3007 | 500 | 306, 292 (merge 58, add 234), then stopped execution. 540 total and truncated | 33 s | | Dextroamphetamine | PUBCHEM.COMPOUND:5826 | 230 | 36, 104 (merge 8, add 96), 60 (merge 4, add 56), 54 (merge 36, add 18), 31 (merge 7, add 24) | 28 s | | (+/-)-Methylphenidate hydrochloride (curie given) | PUBCHEM.COMPOUND:44246724 | 0 | no results | 14 s | | Methylphenidate (better curie) | PUBCHEM.COMPOUND:4158 | 314 | 131, 139 (merge 29, add 110), 59 (merge 11, add 48), 40 (merge 20, add 20), 8 (merge 3, add 5) | 35 s | | metformin | PUBCHEM.COMPOUND:4091 | 500 | only 1st template (1184 results) | 27 s | | Atorvastatin | PUBCHEM.COMPOUND:60823 | 500 | only 1st template (564 results) | 17 s | | Valproic acid glucuronide (curie given) | PUBCHEM.COMPOUND:88111 | 0 | no results | 14 s | | Valproic acid (better curie) | PUBCHEM.COMPOUND:3121 | 500 | only 1st template (2553 results) | 1 min 7 s | | Vitamin A / retinol | PUBCHEM.COMPOUND:445354 | 500 | 487, 359 (merge 86, add 273), then stopped execution. 760 total, and truncated | 32 s | | Vitamin C (ascorbic acid) | PUBCHEM.COMPOUND:54670067 | 500 | only 1st template (850 results) | 22 s | | Vitamin D (Cholecalciferol) | PUBCHEM.COMPOUND:5280795 | 500 | only 1st template (537 results) | 16 s | | Maltodextrin (curie given) | PUBCHEM.COMPOUND:79025 | 3 | 1 from 4th template, 2 from 5th template. None merged. | 17 s | | Glucose (better curie) | PUBCHEM.COMPOUND:24749 | 54 | only 1st template. But none scored... | 15 s | | Magnesium ion (curie given) | PUBCHEM.COMPOUND:888 | 46 | only 1st template | 14 s | | Magnesium (atom, better curie) | PUBCHEM.COMPOUND:5462224 | 500 | 211, 236 (merge 9, add 227), 212 (52 merged, add 160), then stopped execution. 598 total and truncated | 23 s | | DHEA (Dehydroepiandrosterone) | PUBCHEM.COMPOUND:5881 | 500 | 438, 121 (merge 46, add 75), then stopped execution. 513 total, and truncated | 25 s | | Testosterone | PUBCHEM.COMPOUND:6013 | 500 | only 1st template (1069 results) | 30 s | | Ethinylestradiol (curie given) | PUBCHEM.COMPOUND:5991 | 500 | 271, 2390 (merge 107, add 2283), then stopped execution. 2554 total, and truncated | 1 min 19 s | | Estrogens (another curie) | UMLS:C0014939 | 500 | only 1st template (3531 results) | 2 min 2 s | | Somatostatin acetate (curie given) | PUBCHEM.COMPOUND:16129681 | 184 | only 1st template | 20 s | | Somatostatin (better curie) | PUBCHEM.COMPOUND:101826531 | 218 | 183, 16 (merge 10, add 6), 0, 42 (merge 26, add 16), 21 (merge 8, add 13) | 24 s | | Amitriptyline | PUBCHEM.COMPOUND:2160 | 500 | 172, 216 (merge 31, add 185), 456 (merge 84, add 372), then stopped execution. 729 total, and truncated | 37 s | | Gabapentin | PUBCHEM.COMPOUND:3446 | 186 | 70, 3 (merge 1, add 2), 107 (merge 15, add 92), 18 (merge 7, add 11), 13 (merge 2, add 11) | 28 s | | Propranolol | PUBCHEM.COMPOUND:4946 | 500 | 327, 220 (merge 57, 163), 277 (merge 101, add 176), then stopped execution. 666 total, and truncated | 34 s | | sumatriptan | PUBCHEM.COMPOUND:5358 | 185 | 38, 4 (merge 1, add 3), 11 (merge 2, add 9), 121 (merge 7, add 114), 37 (merge 16, add 21) | 31 s | | d4-Palmitic acid (curie given) | PUBCHEM.COMPOUND:12358543 | 0 | no results | 14 s | | palmitic acid (better curie) | PUBCHEM.COMPOUND:135369651 | 500 | only 1st template (905 results) | 18 s |

starting with gene

| Gene Name | Gene ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|-------------------------------|------| | MAPK8IP3 | NCBIGene:23162 | 4 | only 1st template, none scores. Compare to `decreases` response below | 19 s | | TP53 | NCBIGene:7157 | 500 | only 1st template (1762 results) | 31 s | | CYP27B1 (a P450) | NCBIGene:1594 | 500 | 48, 0, 0, 770 (merge 1, add 769), then stopped execution. 817 total, and truncated | 26 s | | XIST | NCBIGene:7503 | 6 | only 1st template | 13 s | | TSIX | NCBIGene:9383 | 0 | no results | 11 s | | ADNP | NCBIGene:23394 | 297 | from 1st template (4) and 4th template (293), no merging. None scored | 22 s | | BORCS8-MEF2B | NCBIGene:4207 | 0 | no results | 12 s | | MMP9 | NCBIGene:4318 | 500 | only 1st template (760 results) | 20 s | | MEF2C | NCBIGene:4208 | 500 | 48, 1, 0, 6431 (merge 7, add 6424), then stopped execution. 6473 total, and truncates | 1 min 48 s | | MALAT1 | NCBIGene:378938 | 4 | only 1st template | 12 s | | DYRK1A | NCBIGene:1859 | 500 | 14, 0, 0, 7982 (merge 1, add 7981), then stopped execution. 7995 total, and truncates | 2 min 1 s | | PCSK1N | NCBIGene:27344 | 12 | from 1st template (6) and 4th template (6), no merging | 17 s | | TTR | NCBIGene:7276 | 500 | 92, 30 (merge 1, add 29), 0, 9259 (merge 31, add 9228), then stop execution and truncate. | 2 min 54 s | Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running > 14 min

Decreased

starting with chem

| Chem Name | Chem ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|---------------------------------|-----| | Amphetamine | PUBCHEM.COMPOUND:3007 | 500 | 207, 330 (merge 34, add 296), then stopped execution. 503 total and truncated | 25 s | | Dextroamphetamine | PUBCHEM.COMPOUND:5826 | 237 | 21, 131, 40 (merge 11, add 29), 67 (merge 34, add 33), 31 (merge 8, add 23) | 30 s | | (+/-)-Methylphenidate hydrochloride (curie given) | PUBCHEM.COMPOUND:44246724 | 0 | no results | 15 s | | Methylphenidate (better curie) | PUBCHEM.COMPOUND:4158 | 335 | 113, 182 (merge 23, add 159), 42 (merge 9, add 33), 54 (merge 25, add 29), 8 (merge 7, add 1) | 34 s | | metformin | PUBCHEM.COMPOUND:4091 | 500 | only 1st template (1549 results) | 32 s | | Atorvastatin | PUBCHEM.COMPOUND:60823 | 500 | only 1st template (788 results) | 18 s | | Valproic acid glucuronide (curie given) | PUBCHEM.COMPOUND:88111 | 0 | no results | 14 s | | Valproic acid (better curie) | PUBCHEM.COMPOUND:3121 | 500 | only 1st template (2034 results) | 54 s | | Vitamin A / retinol | PUBCHEM.COMPOUND:445354 | 500 | 353, 408 (merge 89, add 319), then stopped execution. 672 total, and truncated | 27 s | | Vitamin C (ascorbic acid) | PUBCHEM.COMPOUND:54670067 | 500 | only 1st template (826 results) | 22 s | | Vitamin D (Cholecalciferol) | PUBCHEM.COMPOUND:5280795 | 500 | 428, 115 (merge 36, add 79), then stopped execution. 507 total, and truncates | 23 s | | Maltodextrin (curie given) | PUBCHEM.COMPOUND:79025 | 2 | only from 5th template. None scored | 16 s | | Glucose (better curie) | PUBCHEM.COMPOUND:24749 | 27 | only 1st template. None scored | 15 s | | Magnesium ion (curie given) | PUBCHEM.COMPOUND:888 | 17 | only 1st template | 15 s | | Magnesium (atom, better curie) | PUBCHEM.COMPOUND:5462224 | 497 | 138, 266 (merge 6, add 260), 145 (merge 46, add 99), 0, 0 | 30 s | | DHEA (Dehydroepiandrosterone) | PUBCHEM.COMPOUND:5881 | 500 | 316, 146 (merge 45, add 101), 122 (merge 54, add 68), 62 (merge 36, add 26), then stopped execution. 511 total, and truncated | 36 s | | Testosterone | PUBCHEM.COMPOUND:6013 | 500 | only 1st template (846 results) | 24 s | | Ethinylestradiol (curie given) | PUBCHEM.COMPOUND:5991 | 500 | 217, 2353 (merge 76, add 2277), then stopped execution. 2494 total, and truncated | 1 min 11 s | | Estrogens (another curie) | UMLS:C0014939 | 500 | only 1st template (3294 results) | 1 min 22 s | | Somatostatin acetate (curie given) | PUBCHEM.COMPOUND:16129681 | 39 | only 1st template | 16 s | | Somatostatin (better curie) | PUBCHEM.COMPOUND:101826531 | 276 | 231, 62 (merge 43, add 19), 0, 95 (merge 70, add 25), 21 (merge 20, add 1 | 28 s | | Amitriptyline | PUBCHEM.COMPOUND:2160 | 500 | 268, 253 (merge 54, add 199), 335 (merge 96, add 239), then stopped execution. 706 total, and truncated | 34 s | | Gabapentin | PUBCHEM.COMPOUND:3446 | 224 | 131, 3, 60 (merge 11, add 49), 47 (merge 13, add 34), 13 (merge 6, add 7) | 27 s | | Propranolol | PUBCHEM.COMPOUND:4946 | 500 | only 1st template (618 results) | 18 s | | sumatriptan | PUBCHEM.COMPOUND:5358 | 234 | 73, 18 (merge 8, add 10), 4 (merge 2, add 2), 166 (merge 27, add 139), 37 (merge 27, add 10) | 31 s | | d4-Palmitic acid (curie given) | PUBCHEM.COMPOUND:12358543 | 0 | no results | 14 s | | palmitic acid (better curie) | PUBCHEM.COMPOUND:135369651 | 500 | only 1st template (693 results) | 20 s |

starting with gene

| Gene Name | Gene ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|-------------------------------|------| | MAPK8IP3 | NCBIGene:23162 | 500 | 7, 0, 0, 756, then stopped execution. 763 total and truncated | 24 s | | TP53 | NCBIGene:7157 | 500 | only 1st template (1232 results) | 27 s | | CYP27B1 (a P450) | NCBIGene:1594 | 500 | 31, 0, 0, 6421 (merge 3, add 6418), then stopped execution. 6449 total, and truncated | 1 min 43 s | | XIST | NCBIGene:7503 | 7 | only 1st template | 13 s | | TSIX | NCBIGene:9383 | 0 | no results | 12 s | | ADNP | NCBIGene:23394 | 364 | from 1st template (3) and 4th template (361), no merging. None scored | 22 s | | BORCS8-MEF2B | NCBIGene:4207 | 0 | no results | 12 s | | MMP9 | NCBIGene:4318 | 500 | only 1st template (1404 results) | 27 s | | MEF2C | NCBIGene:4208 | 500 | 34, 0, 0, 6841 (merge 3, add 6424), then stopped execution. 6872 total, and truncates | 1 min 43 s | | MALAT1 | NCBIGene:378938 | 9 | only 1st template | 13 s | | DYRK1A | NCBIGene:1859 | 500 | 49, 0, 0, 4023 (merge 14, add 4009), then stopped execution. 4058 total, and truncates | 1 min 3 s | | PCSK1N | NCBIGene:27344 | 0 | no results | 1 min 31 s | | TTR | NCBIGene:7276 | 500 | 84, 7, 16, 4003 (merge 19, add 3984), then stopped execution. 4091 total, and truncates | 1 min 8 s | Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running 15 min

tokebe commented 1 year ago

@colleenXu Regarding Point 1, I think your confusion comes from assuming that the merge log is supposed to be per-template. It isn't. Records are merged per-template, but the only merge log is a summary at the end of all merged results across all templates. Please pull and run again, and you'll see the log had been updated to show both the number of results merged, the number they were merged into, and the actual result count decrease. If you add up the results for each template, and then subtract the actual result count decrease, the math checks out (I spent a considerable amount of time verifying this last round...).

We could log per-template as well, but I don't particularly see the need to do so?

Running GAPDH on my local takes 19.9 minutes, which I agree is a little too much. Some of my optimizations may help with this, but I do think there's a case to be made for further decreasing the max records allowed.

tokebe commented 1 year ago

After a meeting with @colleenXu the problem was confirmed. I've investigated the issue and found the reason:

Multiple results from the same template can end up being merged if that template is a multi-hop and the results connect the same subject and object via different intermediate nodes. IIRC, this was an intended behavior to keep results relatively well-organized. Such results would not be merged in non-creative execution (which another question worth asking somewhere else).

As a side effect of this, merging can show more results merged than what one might expect: instead of the maximum number merged in a step being equal to the smallest of either the current result set or the current template, it is actually the sum of those two.

@colleenXu I'm working on a fix to change the logging behavior to explicitly point this out when it occurs, and will push that change to this branch (and main) when it's done.

tokebe commented 1 year ago

@colleenXu I've pushed multiple creative mode logging fixes and improvements and the math appears to check out now. Please run a couple tests and let me know if the logging seems better.

colleenXu commented 1 year ago

@tokebe Err...I'm not sure if you missed my update 3 yesterday (the internal Slack thread here).

I think the issue was that I didn't set the is_set: true parameter for intermediate QNodes in the templates. I pushed a commit here and tested, and then the "merging" logs looked reasonable...

colleenXu commented 1 year ago

Feedback:

APIs summary log at the end is broken when only the first template is run?

   2023-01-25T03:02:54.891Z INFO:    [Template-1]: Execution Summary: (906) nodes / (942) edges / (905) results; (3/36) queries returned results from (2) unique APIs
   2023-01-25T03:02:54.891Z INFO:    [Template-1]: APIs: BioThings SEMMEDDB API, Text Mining Targeted Association API
   2023-01-25T03:02:54.895Z INFO:    (0) results from Template-1 were merged with other results from the template. (0) results were merged with existing results from previous templates. Current result count is 905 (+905)
   2023-01-25T03:02:54.895Z INFO:    Addition of 905 results from Template 1 exceeds creative result maximum of 500 (reaching 905 merged). Response will be truncated to top-scoring 500 results. Skipping remaining 4 templates.
   2023-01-25T03:02:54.895Z INFO:    Final result count (before truncation): 905
   2023-01-25T03:02:54.897Z INFO:    Execution Summary: (501) nodes / (537) edges / (500) results; (0/36) queries returned results from (0) unique APIs
   2023-01-25T03:02:54.897Z INFO:    APIs:
   2023-01-25T03:02:54.897Z INFO:    Scoring Summary: (273) scored / (227) unscored

Otherwise, new logs look good!

tokebe commented 1 year ago

I didn’t miss your update. Regardless of is_set behavior, the behavior without is_set looked wrong and needed fixing. I confirmed what was wrong with log clarity for those cases and fixed them.

Fix for the API end summary incoming.

tokebe commented 1 year ago

Pushed the fix; yet another fun case of a change somehow not making it into a commit while silently remaining on my local, making me think I'm losing my mind lol.

colleenXu commented 1 year ago

Sorry for the late reply. I reran a bunch of queries and I think things look good! I like the new logs.

Perhaps we're ready to make a request for ITRB CI?

colleenXu commented 1 year ago

Note that templates were changed, replacing physically_interacts_with predicate with interacts_with (more general). Allows us to use dgidb for those templates (and mychem after its edits https://github.com/biothings/pending.api/issues/101#issuecomment-1418656362)

https://github.com/biothings/bte_trapi_query_graph_handler/pull/135

tokebe commented 1 year ago

Deployed to prod 🚀

colleenXu commented 1 year ago

Noting here just in case:

Old template ideas that weren't implemented (intended effect is "downregulates"):

Chem downregulates (non-human) gene, then Gene is ortholog of human gene
- not sure that this would work, especially when I can't specify human vs non-human genes right now
Chem upregulates a Gene, then that Gene does a BP that negatively-regulates another BP done by another Gene

biothings / biothings_explorer

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

534 Is now addressed in add-qualifiers PRs, so integration testing is now possible.

Knowledge resources related to this set of creative-mode questions (My analysis)

Resources ready and BTE already uses

Want update and BTE already uses

Semmeddb

BindingDB

Want new pending API

MyChem `chembl.drug_mechanisms` data, in subject-association-object format

MyChem `drugcentral.bioactivity` data, in subject-association-object format

Not promising

Discussed

Testing

Increases

Decreases

Second round of testing

Increased

Decreased

biothings / biothings_explorer

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

534 Is now addressed in add-qualifiers PRs, so integration testing is now possible.

Knowledge resources related to this set of creative-mode questions (My analysis)

Resources ready and BTE already uses

Want update and BTE already uses

Semmeddb

BindingDB

Want new pending API

MyChem chembl.drug_mechanisms data, in subject-association-object format

MyChem drugcentral.bioactivity data, in subject-association-object format

Not promising

Discussed

Testing

Increases

Decreases

Second round of testing

Increased

Decreased

MyChem `chembl.drug_mechanisms` data, in subject-association-object format

MyChem `drugcentral.bioactivity` data, in subject-association-object format