biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" #532

Closed andrewsu closed 1 year ago

andrewsu commented 1 year ago

Translator consortium target for implementation in Feb 2023. Exact TRAPI query templates will be created soon per Architecture meeting 2022-12-06...

EDIT Implementation dates:

colleenXu commented 1 year ago

Note:

Possibly relevant, from Slack:

Andy Crouse (Unsecret Agent, UI team) 1:13 PM This is the test set of genes and drugs for next creative mode query. Note there are some ‘destroyer of worlds’ genes and drugs (valproic acid, vitamin A, TP53, P450) So if the lights dim when testing that is why. But they should be good for pressure testing! I included some that are not classically druggable like none coding and transcription factors. So hopefully this will cover the gambit. https://docs.google.com/document/d/1X3XbHhS_AIGkSyxSaYUuiCqVNh33Wz7f3EEyKZEctL8/edit?usp=sharing

andrewsu commented 1 year ago

The proposed TRAPI query templates are defined in these issues:

tokebe commented 1 year ago

534 Is now addressed in add-qualifiers PRs, so integration testing is now possible.

colleenXu commented 1 year ago

Noting:

tokebe commented 1 year ago

@colleenXu Presently, the qualifier-matching only supports one qualifier-set per templateGroup. Would you prefer I change the implementation to support multiple qualifier-sets in a templateGroup for OR matching within a single group (presently, this would be covered by making multiple templateGroups)?

Examples:

Presently:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": {
      "some_qualifier_type": "some_qualifier_value"
    }
  }

Proposed:

{
    "name": "Some template group",
    "subject": ["ChemicalEntity"],
    "predicate": ["affects"],
    "object": ["Gene"],
    "qualifiers": [
      {
        "some_qualifier_type": "some_qualifier_value"
      },
      {
        "some_other_type": "this_would_be_OR"
      }
    ]
  }
colleenXu commented 1 year ago

I think using 1 qualifier-set for matching to templateGroup is fine because that's what we see in this issue (and near-future).

In the third bullet point of my post above, I'm saying that for OUR individual templates, I'll need to use multiple qualifier-sets to query for "activity" vs "activity_or_abundance" because we don't do qualifier-hierarchy expansion yet.

colleenXu commented 1 year ago

Knowledge resources related to this set of creative-mode questions (My analysis)

This is a work-in-progress, so edits will be done over time

Resources ready and BTE already uses

Note that Multiomics APIs has some potentially relevant info but we have some concerns:

  • drug response: has gene-gene relationships, for negative and positive correlation. But not clear where this comes from or how to use this
  • Wellness: has chem-chem and chem-gene relationships, for correlation and related_to. But not clear where this comes from or how to use this

Want update and BTE already uses

Semmeddb

BindingDB

Want new pending API

MyChem chembl.drug_mechanisms data, in subject-association-object format

MyChem drugcentral.bioactivity data, in subject-association-object format

Not promising

expand to read #### Bioplanet * within a pathway, there can be "activation" and "inhibition" arrows. However.... * I find these hard to understand (see [example](https://tripod.nih.gov/bioplanet/detail.jsp?pid=bioplanet_49&target=pathway)) * the actions may be done by complexes or done on complexes (not individual genes or chemicals) * not clear how many pathways involve exogenous chemicals... * I don't see an easy download/way of parsing these pathways to get "chemical-gene" associations (biopax?)
colleenXu commented 1 year ago
Results for the direct-increases.json template with Gene Inputs Reasonable number of results | Gene Name | Gene ID | Time to run (s) | Number of results |Notes| | ------------| ---------| ------------|-------------------|------| | MAPK8IP3 | NCBIGene:23162 | 11 | 4 | but only 1 looks correct; Text-Mining | | P450 (CYP27B1?) | NCBIGene:1594 | 12 | 48 | semmeddb + text-mining | | XIST | NCBIGene:7503 | 12 | 4 | semmeddb | | ADNP | NCBIGene:23394 | 12 | 4 | semmeddb + text-mining | | MEF2C | NCBIGene:4208 | 12 | 48 | text-mining | | MALAT1 | NCBIGene:378938 | 11 | 3 | semmeddb | | DYRK1A | NCBIGene:1859 | 11 | 14 | text-mining | | PCSK1N | NCBIGene:27344 | 11 | 6 | semmeddb + text-mining | | TTR | NCBIGene:7276 | 12 | 91 | semmeddb + text-mining | | GNAS | NCBIGene:2778 | 11 | 9 | semmeddb | | IAPP | NCBIGene:3375 | 12 | 51 | text-mining | | SCG5 | NCBIGene:6447 | 11 | 1 | text-mining | | B2M | NCBIGene:567 | 12 | 66 | text-mining | | Cytochrome B | NCBIGene:4519 | 12 | 16 | text-mining | | RBP4 | NCBIGene:5950 | 12 | 16 | text-mining | | NGLY1 | NCBIGene:55768 | 11 | 2 | text-mining | | RHOBTB2 | NCBIGene:23221 | 11 | 3 | text-mining | | WDFY3 | NCBIGene:23001 | 12 | 5 | text-mining | | KCNT1 | NCBIGene:57582 | 11 | 5 | **DGIDB** + text-mining | | NALCN | NCBIGene:259232 | 11 | 1 | semmeddb | | CACNA1A | NCBIGene:773 | 11 | 2 | text-mining | | BICD2 | NCBIGene:23299 | 10 | 4 | semmeddb + text-mining | | FHL1 (CFH) | NCBIGene:3075 | 11 | 4 | semmeddb + text-mining | | SETBP1 | NCBIGene:26040 | 12 | 3 | text-mining | Many results (exploding) | Gene Name | Gene ID | Time to run (s) | Number of results |Notes| | ------------| ---------| ------------|-------------------|------| | TP53 | NCBIGene:7157 | 32 | 1744 | semmeddb + text-mining | | MMP9 | NCBIGene:4318 | 20 | 753 | semmeddb + text-mining | | GAPDH | NCBIGene:2597 | 15 | 220 | semmeddb + text-mining | | Cytochrome oxidase | NCBIGene:4512 | 14 | 177 | semmeddb + text-mining | | PPP3CA | NCBIGene:5530 | 14 | 182 | semmeddb + text-mining | | ATF3 | NCBIGene:467 | 14 | 226 | semmeddb + text-mining | No results | Gene Name | Gene ID | Time to run (s) | Number of results |Notes| | ------------| ---------| ------------|-------------------|------| | TSIX | NCBIGene:9383 | 13 | 0 | | | BORCS8-MEF2B | NCBIGene:4207 | 9 | 0 | | | FAU | NCBIGene:2197 | 10 | 0 | | | DHX30 | NCBIGene:22907 | 10 | 0 | | | SAMD9L | NCBIGene:219285 | 10 | 0 | | | Engase | NCBIGene:64772 | 10 | 0 | |
colleenXu commented 1 year ago

I found that 2 templateGroup items worked: one for Chem-increases-Gene and one for Chem-decreases-Gene (input ID can be on the Chem QNode or the Gene QNode). I wrote the templateGroups and the templates @andrewsu + I discussed on Monday and did a bit of testing (see next post). These are in this branch https://github.com/biothings/bte_trapi_query_graph_handler/pull/132

template ideas we decided to write up - Chem (increases/decreases) gene (direct) - Chem increases another gene, then that gene (upregs/downregs) Gene (upregs -> overall `increase`, downregs -> overall `decrease`) - I modified QEdges to be more specific: removing `increased activity_or_abundance` for eA and `increased`/`decreased` `activity_or_abundance` for eB. This removed semmeddb and some dgidb edges... - Chem decreases another gene, then that gene (downregs/upregs) Gene (downregs -> overall `increase`, upregs -> overall `decrease`) - I modified QEdges to be more specific (removing `increased activity_or_abundance` for eA and `increased`/`decreased` `activity_or_abundance` for eB). This removed semmeddb and some dgidb edges... - Chem interacts with another gene, then that gene (upreg/downreg) gene - I modified QEdges to be more specific: using `physically_interacts_with` for eA and `increased`/`decreased` `activity_or_abundance` for eB. This removed semmeddb edges and a major dgidb edge... - Chem interacts with gene (direct) - I modified QEdges to be more specific: using `physically_interacts_with` for eA. This removed semmeddb edges and a major dgidb edge...
colleenXu commented 1 year ago
How I tested Check out fix_issue_532 branch for query-handler and dev branches for all other modules (including bte-trapi-workspace). For the "user"/"from UI/ARS" query...the main skeleton of the query is this. What varies is (1) which QNode also has an ids field with 1 user-submitted ID and (2) whether the object_direction_qualifier value is increased or decreased (I put X there).
skeleton ``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "X" } ] } ] } } } } } ```
So here's an example of ChemX (sumatriptan) increases genes ``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"], "ids": ["PUBCHEM.COMPOUND:5358"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "increased" } ] } ] } } } } } ```

Discussed

EDIT: discussed below + during Wednesday 1/11 meeting. Decisions added to each point

@tokebe

After testing https://github.com/biothings/bte_trapi_query_graph_handler/pull/132 (it's branched off dev, and I tested it with all my other branches as dev)....overall it looks close to done. However, I have the following questions / issues:

  1. Looking at the logs, I'm not sure if some results are "pruned / removed" incorrectly. A confounding / related issue is that I don't see "same results from multiple templates merging" console logs and the "result merging" TRAPI-logs are buggy. -> JC to look into this (aka issues with creative-mode)
    • Using the chem sumatriptan PUBCHEM.COMPOUND:5358 is a good example (runs fairly fast, results from multiple templates and merging happens). See its entries in the Chem tables of the next post.
  2. BTE isn't doing "exact qualifier-set" matching to templateGroups. Is that alright? -> CX asked Translator group (Translator private Slack link). This behavior is fine right now
    • For example, if I send a query with an "inferred" "activity_or_abundance" QEdge (no increased/decreased direction qualifier), all 10 templates would run...
query with no direction qualifier so all (EDIT: 9) templates will be loaded ``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"], "ids": ["PUBCHEM.COMPOUND:5358"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" } ] } ] } } } } } ```
  1. I noticed something odd with the "limiting execution" code - where the records accumulating for a QEdge seemed a lot higher than the stated limit...The console log said bte:call-apis:query QEdge eA obtained 59903 records, exceeding maximum of 30000. Skipping remaining 1 (0 planned/1 paged) queries for this edge. Your query may be too general? +0ms. -> It checks after each individual sub-query. Sometimes the excess is a lot! JC will adjust code to truncate and remove the records over the maximum (he'll decide how to implement).
    • this log happened during the 4th template for Chem decreases GAPDH/NCBIGene:2597. It then was taking a very long time for the ID resolution (so I killed the execution manually).
    • Something similar happens for increases GAPDH but there are between 30k-40k records there (still takes too long for ID resolution).
    • See GAPDH's entries in the Gene tables of the next post.
Starting query for Chem decreases GeneY (GAPDH) ``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"], "ids": ["NCBIGene:2597"] }, "chemical": { "categories": ["biolink:ChemicalEntity"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "increased" } ] } ] } } } } } ```
  1. Still having issues with the logs summarizing the execution of each QEdge: "eA03 execution: 0 queries (0 success/0 fail) and (0) cached qEdges return (80) records". I mentioned that here (lab's internal Slack) -> JC to look into this (aka issues with creative-mode)
  2. I think we can consider drastically dropping the max number of results to return (maybe 200? 400?). -> we discussed and agreed to drop max number to 500 for all creative-mode (same as ARAX)
colleenXu commented 1 year ago

Testing

[reverted; this records the testing done 1/10]

Increases

Chem Name Chem ID Final results Template breakdown and notes
metformin PUBCHEM.COMPOUND:4091 1000 > 1000 on first template
palmitic acid PUBCHEM.COMPOUND:12358543 0 hmmm...example of no results
sumatriptan PUBCHEM.COMPOUND:5358 137 37, 4, 2, 85, 32 (160 sum). Results dropping or merging?
Gene Name Gene ID Final results Template breakdown and notes
MAPK8IP3 NCBIGene:23162 4 4 from first template, none from the rest. Compare to decreases entry.
P450 (CYP27B1?) NCBIGene:1594 832 48, 0, 0, 770, 18 (836 sum). Results dropping or merging?
MEF2C NCBIGene:4208 1000 48, 1, 0, 6453 (stop).
GAPDH NCBIGene:2597 presume 1000 220, 88, 112, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.

Decreases

Chem Name Chem ID Final results Template breakdown and notes
metformin PUBCHEM.COMPOUND:4091 1000 > 1000 on first template
palmitic acid PUBCHEM.COMPOUND:12358543 0 hmmm...example of no results
sumatriptan PUBCHEM.COMPOUND:5358 177 56, 19, 3, 132, 32 (242 sum). Results dropping or merging?
Gene Name Gene ID Final results Template breakdown and notes
MAPK8IP3 NCBIGene:23162 761 7, 0, 0, 754, 0 (sum 761, so no dropping/merging)
P450 (CYP27B1?) NCBIGene:1594 1000 31, 0, 0, 6694 (stop)
MEF2C NCBIGene:4208 1000 34, 0, 0, 6923 (stop)
GAPDH NCBIGene:2597 presume 1000 290, 120, 24, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion.
tokebe commented 1 year ago
  1. I'll have to take a closer look at issues with results merging/pruning.
  2. I had previously stated that, currently, templates are matched purely by the query qualifier set being a subset of the templateGroup, rather than an exact match (i.e. both sets are equivalent, templateGroup has no additional qualifiers). I can change this to expecting an exact match instead?
  3. the record cutoff is checked after each query. If a query returns an absurd number of records, it's possible to exceed the record cutoff by an absurd amount. Should we instead delete records above the cutoff? I'd like @andrewsu's opinion on this as well.
  4. Looking into it.
  5. I don't have a problem with this -- @andrewsu?
andrewsu commented 1 year ago
  1. @colleenXu can you post the question to one of the translator slack channels? I can see arguments both ways, and I think this is something that should be standardized across ARAs.
  2. Originally my impression was that if we have the results beyond our threshold, might as well keep them. But I hadn't thought of the fact that other downstream things need to happen (like ID resolution). If those downstream things are essentially preventing BTE from returning the partial results, then I vote for deleting excess records. If not (and the issue is only the oddity of exceeding our threshold by a significant margin), then I vote for keeping the excess records. EDIT: decision made during 2023-01-11 meeting to delete excess records.

5. I'd suggest 500 to match what ARAX returns.

tokebe commented 1 year ago

@colleenXu I've fixed the results merging count and logging. As a result, I can confirm that no results are being dropped. Running sumatriptan, I get 433 results added across the templates, and 244 results in the final response. Logging now shows 189 results that were combined. 433 - 189 = 244, everything checks out. I was actually running in circles for a while because I thought I had to account for the number of results merged into, but this is simply a count of results from the 244 that were combined from multiple templates.

I'm not sure what's different between our locals that cause a difference in numbers, (perhaps I haven't updated specs recently?), but all results are accounted for, and no code is dropping results.

I've also fixed issue 4 -- a piece of code in call-apis was still using pre-qedge-refactor accessors, breaking the counts.

Additionally, I've pushed changes so records are truncated to the cutoff, and changed the creative limit to 500.

All set to proceed with testing again 👍

colleenXu commented 1 year ago

@tokebe

Questions I have after my second round of testing:

{
    "message": {
        "query_graph": {
            "nodes": {
                "gene": {
                    "categories": ["biolink:Gene"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"],
                    "ids": ["PUBCHEM.COMPOUND:5358"]
                }    
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "chemical",
                    "predicates": ["biolink:affects"],
                    "knowledge_type": "inferred",
                    "qualifier_constraints": [
                        {
                            "qualifier_set": [
                                {
                                    "qualifier_type_id": "biolink:object_aspect_qualifier",
                                    "qualifier_value": "activity_or_abundance"
                                },
                                {
                                    "qualifier_type_id": "biolink:object_direction_qualifier",
                                    "qualifier_value": "decreased"
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

I circled in orange the numbers mentioned above. Screen Shot 2023-01-19 at 3 50 51 PM

TRAPI query for `increased` estrogen + screenshot of logs For `increases`, the estrogen query logs say 271 + 5692 results have 3409 results "merged" to "1001 final results". ``` { "message": { "query_graph": { "nodes": { "gene": { "categories": ["biolink:Gene"] }, "chemical": { "categories": ["biolink:ChemicalEntity"], "ids": ["PUBCHEM.COMPOUND:5991"] } }, "edges": { "t_edge": { "object": "gene", "subject": "chemical", "predicates": ["biolink:affects"], "knowledge_type": "inferred", "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:object_aspect_qualifier", "qualifier_value": "activity_or_abundance" }, { "qualifier_type_id": "biolink:object_direction_qualifier", "qualifier_value": "increased" } ] } ] } } } } } ``` I circled in orange the numbers mentioned above. ![Screen Shot 2023-01-19 at 11 24 06 PM](https://user-images.githubusercontent.com/43731687/213640157-e97114a2-f0c4-4eca-9a59-ce5b7931d21d.png)

(Points 4-5 from the post have been addressed by the recent changes (yay), and point 2 is still pending / doesn't seem to be an issue)

colleenXu commented 1 year ago

Second round of testing

[EDITED 1/23-1/26 after is_set: true added to templates and fixes added to logging. Rearranged list to match original Translator issue curie lists]

I tested more chemicals and genes this time, including all chemicals listed in the Translator posts

Increased

starting with chem | Chem Name | Chem ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|---------------------------------|-----| | Amphetamine | PUBCHEM.COMPOUND:3007 | 500 | 306, 292 (merge 58, add 234), then stopped execution. 540 total and truncated | 33 s | | Dextroamphetamine | PUBCHEM.COMPOUND:5826 | 230 | 36, 104 (merge 8, add 96), 60 (merge 4, add 56), 54 (merge 36, add 18), 31 (merge 7, add 24) | 28 s | | (+/-)-Methylphenidate hydrochloride (curie given) | PUBCHEM.COMPOUND:44246724 | 0 | no results | 14 s | | Methylphenidate (better curie) | PUBCHEM.COMPOUND:4158 | 314 | 131, 139 (merge 29, add 110), 59 (merge 11, add 48), 40 (merge 20, add 20), 8 (merge 3, add 5) | 35 s | | metformin | PUBCHEM.COMPOUND:4091 | 500 | only 1st template (1184 results) | 27 s | | Atorvastatin | PUBCHEM.COMPOUND:60823 | 500 | only 1st template (564 results) | 17 s | | Valproic acid glucuronide (curie given) | PUBCHEM.COMPOUND:88111 | 0 | no results | 14 s | | Valproic acid (better curie) | PUBCHEM.COMPOUND:3121 | 500 | only 1st template (2553 results) | 1 min 7 s | | Vitamin A / retinol | PUBCHEM.COMPOUND:445354 | 500 | 487, 359 (merge 86, add 273), then stopped execution. 760 total, and truncated | 32 s | | Vitamin C (ascorbic acid) | PUBCHEM.COMPOUND:54670067 | 500 | only 1st template (850 results) | 22 s | | Vitamin D (Cholecalciferol) | PUBCHEM.COMPOUND:5280795 | 500 | only 1st template (537 results) | 16 s | | Maltodextrin (curie given) | PUBCHEM.COMPOUND:79025 | 3 | 1 from 4th template, 2 from 5th template. None merged. | 17 s | | Glucose (better curie) | PUBCHEM.COMPOUND:24749 | 54 | only 1st template. But none scored... | 15 s | | Magnesium ion (curie given) | PUBCHEM.COMPOUND:888 | 46 | only 1st template | 14 s | | Magnesium (atom, better curie) | PUBCHEM.COMPOUND:5462224 | 500 | 211, 236 (merge 9, add 227), 212 (52 merged, add 160), then stopped execution. 598 total and truncated | 23 s | | DHEA (Dehydroepiandrosterone) | PUBCHEM.COMPOUND:5881 | 500 | 438, 121 (merge 46, add 75), then stopped execution. 513 total, and truncated | 25 s | | Testosterone | PUBCHEM.COMPOUND:6013 | 500 | only 1st template (1069 results) | 30 s | | Ethinylestradiol (curie given) | PUBCHEM.COMPOUND:5991 | 500 | 271, 2390 (merge 107, add 2283), then stopped execution. 2554 total, and truncated | 1 min 19 s | | Estrogens (another curie) | UMLS:C0014939 | 500 | only 1st template (3531 results) | 2 min 2 s | | Somatostatin acetate (curie given) | PUBCHEM.COMPOUND:16129681 | 184 | only 1st template | 20 s | | Somatostatin (better curie) | PUBCHEM.COMPOUND:101826531 | 218 | 183, 16 (merge 10, add 6), 0, 42 (merge 26, add 16), 21 (merge 8, add 13) | 24 s | | Amitriptyline | PUBCHEM.COMPOUND:2160 | 500 | 172, 216 (merge 31, add 185), 456 (merge 84, add 372), then stopped execution. 729 total, and truncated | 37 s | | Gabapentin | PUBCHEM.COMPOUND:3446 | 186 | 70, 3 (merge 1, add 2), 107 (merge 15, add 92), 18 (merge 7, add 11), 13 (merge 2, add 11) | 28 s | | Propranolol | PUBCHEM.COMPOUND:4946 | 500 | 327, 220 (merge 57, 163), 277 (merge 101, add 176), then stopped execution. 666 total, and truncated | 34 s | | sumatriptan | PUBCHEM.COMPOUND:5358 | 185 | 38, 4 (merge 1, add 3), 11 (merge 2, add 9), 121 (merge 7, add 114), 37 (merge 16, add 21) | 31 s | | d4-Palmitic acid (curie given) | PUBCHEM.COMPOUND:12358543 | 0 | no results | 14 s | | palmitic acid (better curie) | PUBCHEM.COMPOUND:135369651 | 500 | only 1st template (905 results) | 18 s |
starting with gene | Gene Name | Gene ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|-------------------------------|------| | MAPK8IP3 | NCBIGene:23162 | 4 | only 1st template, none scores. Compare to `decreases` response below | 19 s | | TP53 | NCBIGene:7157 | 500 | only 1st template (1762 results) | 31 s | | CYP27B1 (a P450) | NCBIGene:1594 | 500 | 48, 0, 0, 770 (merge 1, add 769), then stopped execution. 817 total, and truncated | 26 s | | XIST | NCBIGene:7503 | 6 | only 1st template | 13 s | | TSIX | NCBIGene:9383 | 0 | no results | 11 s | | ADNP | NCBIGene:23394 | 297 | from 1st template (4) and 4th template (293), no merging. None scored | 22 s | | BORCS8-MEF2B | NCBIGene:4207 | 0 | no results | 12 s | | MMP9 | NCBIGene:4318 | 500 | only 1st template (760 results) | 20 s | | MEF2C | NCBIGene:4208 | 500 | 48, 1, 0, 6431 (merge 7, add 6424), then stopped execution. 6473 total, and truncates | 1 min 48 s | | MALAT1 | NCBIGene:378938 | 4 | only 1st template | 12 s | | DYRK1A | NCBIGene:1859 | 500 | 14, 0, 0, 7982 (merge 1, add 7981), then stopped execution. 7995 total, and truncates | 2 min 1 s | | PCSK1N | NCBIGene:27344 | 12 | from 1st template (6) and 4th template (6), no merging | 17 s | | TTR | NCBIGene:7276 | 500 | 92, 30 (merge 1, add 29), 0, 9259 (merge 31, add 9228), then stop execution and truncate. | 2 min 54 s | Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running > 14 min

Decreased

starting with chem | Chem Name | Chem ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|---------------------------------|-----| | Amphetamine | PUBCHEM.COMPOUND:3007 | 500 | 207, 330 (merge 34, add 296), then stopped execution. 503 total and truncated | 25 s | | Dextroamphetamine | PUBCHEM.COMPOUND:5826 | 237 | 21, 131, 40 (merge 11, add 29), 67 (merge 34, add 33), 31 (merge 8, add 23) | 30 s | | (+/-)-Methylphenidate hydrochloride (curie given) | PUBCHEM.COMPOUND:44246724 | 0 | no results | 15 s | | Methylphenidate (better curie) | PUBCHEM.COMPOUND:4158 | 335 | 113, 182 (merge 23, add 159), 42 (merge 9, add 33), 54 (merge 25, add 29), 8 (merge 7, add 1) | 34 s | | metformin | PUBCHEM.COMPOUND:4091 | 500 | only 1st template (1549 results) | 32 s | | Atorvastatin | PUBCHEM.COMPOUND:60823 | 500 | only 1st template (788 results) | 18 s | | Valproic acid glucuronide (curie given) | PUBCHEM.COMPOUND:88111 | 0 | no results | 14 s | | Valproic acid (better curie) | PUBCHEM.COMPOUND:3121 | 500 | only 1st template (2034 results) | 54 s | | Vitamin A / retinol | PUBCHEM.COMPOUND:445354 | 500 | 353, 408 (merge 89, add 319), then stopped execution. 672 total, and truncated | 27 s | | Vitamin C (ascorbic acid) | PUBCHEM.COMPOUND:54670067 | 500 | only 1st template (826 results) | 22 s | | Vitamin D (Cholecalciferol) | PUBCHEM.COMPOUND:5280795 | 500 | 428, 115 (merge 36, add 79), then stopped execution. 507 total, and truncates | 23 s | | Maltodextrin (curie given) | PUBCHEM.COMPOUND:79025 | 2 | only from 5th template. None scored | 16 s | | Glucose (better curie) | PUBCHEM.COMPOUND:24749 | 27 | only 1st template. None scored | 15 s | | Magnesium ion (curie given) | PUBCHEM.COMPOUND:888 | 17 | only 1st template | 15 s | | Magnesium (atom, better curie) | PUBCHEM.COMPOUND:5462224 | 497 | 138, 266 (merge 6, add 260), 145 (merge 46, add 99), 0, 0 | 30 s | | DHEA (Dehydroepiandrosterone) | PUBCHEM.COMPOUND:5881 | 500 | 316, 146 (merge 45, add 101), 122 (merge 54, add 68), 62 (merge 36, add 26), then stopped execution. 511 total, and truncated | 36 s | | Testosterone | PUBCHEM.COMPOUND:6013 | 500 | only 1st template (846 results) | 24 s | | Ethinylestradiol (curie given) | PUBCHEM.COMPOUND:5991 | 500 | 217, 2353 (merge 76, add 2277), then stopped execution. 2494 total, and truncated | 1 min 11 s | | Estrogens (another curie) | UMLS:C0014939 | 500 | only 1st template (3294 results) | 1 min 22 s | | Somatostatin acetate (curie given) | PUBCHEM.COMPOUND:16129681 | 39 | only 1st template | 16 s | | Somatostatin (better curie) | PUBCHEM.COMPOUND:101826531 | 276 | 231, 62 (merge 43, add 19), 0, 95 (merge 70, add 25), 21 (merge 20, add 1 | 28 s | | Amitriptyline | PUBCHEM.COMPOUND:2160 | 500 | 268, 253 (merge 54, add 199), 335 (merge 96, add 239), then stopped execution. 706 total, and truncated | 34 s | | Gabapentin | PUBCHEM.COMPOUND:3446 | 224 | 131, 3, 60 (merge 11, add 49), 47 (merge 13, add 34), 13 (merge 6, add 7) | 27 s | | Propranolol | PUBCHEM.COMPOUND:4946 | 500 | only 1st template (618 results) | 18 s | | sumatriptan | PUBCHEM.COMPOUND:5358 | 234 | 73, 18 (merge 8, add 10), 4 (merge 2, add 2), 166 (merge 27, add 139), 37 (merge 27, add 10) | 31 s | | d4-Palmitic acid (curie given) | PUBCHEM.COMPOUND:12358543 | 0 | no results | 14 s | | palmitic acid (better curie) | PUBCHEM.COMPOUND:135369651 | 500 | only 1st template (693 results) | 20 s |
starting with gene | Gene Name | Gene ID | Final results | Template breakdown and notes | Time | |-------------|----------|--------------|-------------------------------|------| | MAPK8IP3 | NCBIGene:23162 | 500 | 7, 0, 0, 756, then stopped execution. 763 total and truncated | 24 s | | TP53 | NCBIGene:7157 | 500 | only 1st template (1232 results) | 27 s | | CYP27B1 (a P450) | NCBIGene:1594 | 500 | 31, 0, 0, 6421 (merge 3, add 6418), then stopped execution. 6449 total, and truncated | 1 min 43 s | | XIST | NCBIGene:7503 | 7 | only 1st template | 13 s | | TSIX | NCBIGene:9383 | 0 | no results | 12 s | | ADNP | NCBIGene:23394 | 364 | from 1st template (3) and 4th template (361), no merging. None scored | 22 s | | BORCS8-MEF2B | NCBIGene:4207 | 0 | no results | 12 s | | MMP9 | NCBIGene:4318 | 500 | only 1st template (1404 results) | 27 s | | MEF2C | NCBIGene:4208 | 500 | 34, 0, 0, 6841 (merge 3, add 6424), then stopped execution. 6872 total, and truncates | 1 min 43 s | | MALAT1 | NCBIGene:378938 | 9 | only 1st template | 13 s | | DYRK1A | NCBIGene:1859 | 500 | 49, 0, 0, 4023 (merge 14, add 4009), then stopped execution. 4058 total, and truncates | 1 min 3 s | | PCSK1N | NCBIGene:27344 | 0 | no results | 1 min 31 s | | TTR | NCBIGene:7276 | 500 | 84, 7, 16, 4003 (merge 19, add 3984), then stopped execution. 4091 total, and truncates | 1 min 8 s | Skipping testing GAPDH (NCBIGene:2597). Previously it ran 3 templates in ~ 1 min. Then the last hop of 4th template gets stuck at its end (ID resolution/record intersecting). I stopped after running 15 min

tokebe commented 1 year ago

@colleenXu Regarding Point 1, I think your confusion comes from assuming that the merge log is supposed to be per-template. It isn't. Records are merged per-template, but the only merge log is a summary at the end of all merged results across all templates. Please pull and run again, and you'll see the log had been updated to show both the number of results merged, the number they were merged into, and the actual result count decrease. If you add up the results for each template, and then subtract the actual result count decrease, the math checks out (I spent a considerable amount of time verifying this last round...).

We could log per-template as well, but I don't particularly see the need to do so?

Running GAPDH on my local takes 19.9 minutes, which I agree is a little too much. Some of my optimizations may help with this, but I do think there's a case to be made for further decreasing the max records allowed.

tokebe commented 1 year ago

After a meeting with @colleenXu the problem was confirmed. I've investigated the issue and found the reason:

Multiple results from the same template can end up being merged if that template is a multi-hop and the results connect the same subject and object via different intermediate nodes. IIRC, this was an intended behavior to keep results relatively well-organized. Such results would not be merged in non-creative execution (which another question worth asking somewhere else).

As a side effect of this, merging can show more results merged than what one might expect: instead of the maximum number merged in a step being equal to the smallest of either the current result set or the current template, it is actually the sum of those two.

@colleenXu I'm working on a fix to change the logging behavior to explicitly point this out when it occurs, and will push that change to this branch (and main) when it's done.

tokebe commented 1 year ago

@colleenXu I've pushed multiple creative mode logging fixes and improvements and the math appears to check out now. Please run a couple tests and let me know if the logging seems better.

colleenXu commented 1 year ago

@tokebe Err...I'm not sure if you missed my update 3 yesterday (the internal Slack thread here).

I think the issue was that I didn't set the is_set: true parameter for intermediate QNodes in the templates. I pushed a commit here and tested, and then the "merging" logs looked reasonable...

colleenXu commented 1 year ago

Feedback:

   2023-01-25T03:02:54.891Z INFO:    [Template-1]: Execution Summary: (906) nodes / (942) edges / (905) results; (3/36) queries returned results from (2) unique APIs
   2023-01-25T03:02:54.891Z INFO:    [Template-1]: APIs: BioThings SEMMEDDB API, Text Mining Targeted Association API
   2023-01-25T03:02:54.895Z INFO:    (0) results from Template-1 were merged with other results from the template. (0) results were merged with existing results from previous templates. Current result count is 905 (+905)
   2023-01-25T03:02:54.895Z INFO:    Addition of 905 results from Template 1 exceeds creative result maximum of 500 (reaching 905 merged). Response will be truncated to top-scoring 500 results. Skipping remaining 4 templates.
   2023-01-25T03:02:54.895Z INFO:    Final result count (before truncation): 905
   2023-01-25T03:02:54.897Z INFO:    Execution Summary: (501) nodes / (537) edges / (500) results; (0/36) queries returned results from (0) unique APIs
   2023-01-25T03:02:54.897Z INFO:    APIs:
   2023-01-25T03:02:54.897Z INFO:    Scoring Summary: (273) scored / (227) unscored

Otherwise, new logs look good!

tokebe commented 1 year ago

I didn’t miss your update. Regardless of is_set behavior, the behavior without is_set looked wrong and needed fixing. I confirmed what was wrong with log clarity for those cases and fixed them.

Fix for the API end summary incoming.

tokebe commented 1 year ago

Pushed the fix; yet another fun case of a change somehow not making it into a commit while silently remaining on my local, making me think I'm losing my mind lol.

colleenXu commented 1 year ago

Sorry for the late reply. I reran a bunch of queries and I think things look good! I like the new logs.

Perhaps we're ready to make a request for ITRB CI?

colleenXu commented 1 year ago

Note that templates were changed, replacing physically_interacts_with predicate with interacts_with (more general). Allows us to use dgidb for those templates (and mychem after its edits https://github.com/biothings/pending.api/issues/101#issuecomment-1418656362)

https://github.com/biothings/bte_trapi_query_graph_handler/pull/135

tokebe commented 1 year ago

Deployed to prod 🚀

colleenXu commented 1 year ago

Noting here just in case:

Old template ideas that weren't implemented (intended effect is "downregulates"):