Generic results for MVP2 queries

cartmanbeck commented 1 year ago

I have run a few queries and the highest-rated results are very generic concepts that will be hard for a user to follow up on. It would be great to find a way to filter out these kinds of results.

One Example: "What chemical dupregulates FREM1 (Human)?" - PK: fa3435ec-fb9b-49a8-93f5-81a4561699c2 https://ui.transltr.io/results?l=FREM1&t=1&q=fa3435ec-fb9b-49a8-93f5-81a4561699c2 Problem result: "Promoter regions, genetic", "Consensus sequence" Note: The 3rd result, Chloride Ion, is excellent. :)

dnsmith124 commented 1 year ago

While the functionality that is being requested is a great idea (add ability to filter out overly generic results), the UI has no way that I'm aware of to determine the genericness of any particular result programmatically.

Is there an attribute that conveys this in the data currently?

andrewsu commented 1 year ago

Just to add some info from https://arax.ci.transltr.io/?r=fa3435ec-fb9b-49a8-93f5-81a4561699c2

Both aragorn and ARAX return "Promoter Regions, Genetic" as a result based on Semmeddb ingested through RTX-KG2. (I'm pretty sure it doesn't come back from the BioThings SemMedDB API because we use the novelty filter, which I think does a reasonable job of removing very generic concepts.)

I also think something weird is going on with score normalization. ARAX reports a score of 1.0, and aragorn reports 0.409. UI normalizes/combines that into a score of 100, which would not be my expectation personally...

ShervinAbd92 commented 1 year ago

@andrewsu I am trying to look into the score normalization for your pk. normally i can curl individual ARA's response and look into their results & scores but for this pk, the provided arax and aragorn's pk is unknown. was this submitted on the CI environment?

andrewsu commented 1 year ago

I was able to retrieve them using https://arax.ci.transltr.io/api/arax/v1.4/response/a9eb462e-e7b8-440b-ae91-083ee71c7b24 (ARAX) and https://arax.ci.transltr.io/api/arax/v1.4/response/119f95b8-de9f-4809-9cac-ea9cb52f5946 (aragorn).

edeutsch commented 1 year ago

Side note: when you requests PKs from ARAX, it fetches them from the ARSes (it tries all of them!) and then also puts the document through some validation (although still turned off) and some basic stats calculation, i.e. this:

  "validation_result": {
    "message": "Validation disabled. too many dependency failures",
    "n_edges": 8,
    "n_nodes": 5,
    "provenance_summary": {
      "n_sources": 0,
      "predicate_counts": {
        "biolink:affects": 1,
        "biolink:occurs_together_in_literature_with": 4,
        "biolink:regulates": 3
      },
      "provenance_counts": {
        "biolink:affects --> no provenance": [
          "biolink:affects",
          "-",
          "no provenance",
          1
        ],
        "biolink:occurs_together_in_literature_with --> no provenance": [
          "biolink:occurs_together_in_literature_with",
          "-",
          "no provenance",
          4
        ],
        "biolink:regulates --> no provenance": [
          "biolink:regulates",
          "-",
          "no provenance",
          3
        ]
      }
    },
    "size": "213 kB",
    "status": "PASS",
    "validation_messages": {
      "errors": [],
      "information": [
        "Validation has been temporarily disabled due to problems with dependencies. Will return again soon."
      ],
      "warnings": []
    }

is injected into the TRAPI by the ARAX system. You would not see that if you requested the PK directly from the ARS.

ShervinAbd92 commented 1 year ago

@andrewsu i am trying to look into the results structure that ARS takes in to generate the normalized score. the link that you sent here don't seem to be TRAPI 1.4.0 compliant like score is not under analyses but i do see the normalized score being calculated so i am not sure how that happened? also if we look into Promoter Region on both ARAx and Aragorn results, i cant find it on the aragorn's results!

sierra-moxon commented 1 year ago

from TAQA: should we remove generic categories? for drugs, the "role" terms (aka: upper classes) are really important. but there is a point we can go to high (e.g. "Drug" "Small Molecule") enrichment might get at that.

MarkDWilliams commented 1 year ago

@ShervinAbd92 and @andrewsu I've taken a look at the results from the ARS end. The current ARS score normalization (which is/was intended to be a temporary measure and will be replaced) only looks at the results from a given ARA, for a given query. So, the difference in scores between ARAX vs Aragorn isn't taken into account. In the ARS, the two results have normalized scores of 100 and 99.6 because they are the highest and 4th highest of 500.

I agree that this scoring is not great. The g() and f() score functions should replace the current ARS normalized scores when they are ready, but ultimately (in my opinion) I think this issue should be solved by eliminating overly generic nodes rather than by changing how results containing them are ranked. It's not that We don't think that it is true that genetic promoter regions regulate FREM1, it's that (true or not), that assertion isn't helpful or interesting.

sierra-moxon commented 1 year ago

@MarkDWilliams - from your last analysis, this looks like a post-september item, is that right?

MarkDWilliams commented 7 months ago

Apologies for the very late reply. We currently have a blocklist implemented for overly generic results at the ARS level and we can add Promoter regions, genetic and Consensus sequence to it.

ShervinAbd92 commented 7 months ago

I have also attempted to send in the query and even aragorn and arax are returning zero results. but in any case as Mark mentioned above, we have add those 2 to our blocklist

NCATSTranslator / Feedback

Generic results for MVP2 queries #191