biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://api.bte.ncats.io
Apache License 2.0
8 stars 9 forks source link

tune the use of AEOLUS indications from mychem.info #727

Closed andrewsu closed 3 weeks ago

andrewsu commented 9 months ago

AEOLUS is a standardized version of the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) data. According to https://www.fda.gov/drugs/surveillance/questions-and-answers-fdas-adverse-event-reporting-system-faers:

The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA.

So essentially it's a community-contributed database that has lots of good stuff, but it also has lots of junk. For example, here is an example record for Escitalopram, a medication used to manage and treat major depressive and generalized anxiety disorders: https://mychem.info/v1/chem/WSEQXVZVJXJVFP-FQEVSTJZSA-N?fields=aeolus. Among the listed "indications" are

  "indications": [
    {
      "count": 765,
      "id": "36918942",
      "meddra_code": "10012378",
      "name": "Depression"
    },
    {
      "count": 219,
      "id": "36918858",
      "meddra_code": "10002855",
      "name": "Anxiety"
    },
    {
      "count": 106,
      "id": "42890454",
      "meddra_code": "10070592",
      "name": "Product used for unknown indication"
    },
    {
      "count": 71,
      "id": "36918945",
      "meddra_code": "10057840",
      "name": "Major depression"
    },
    {
      "count": 33,
      "id": "36918855",
      "meddra_code": "10018075",
      "name": "Generalised anxiety disorder"
    },
    ...
  ]

These generally look good, but lower down, we see this:

    {
      "count": 1,
      "id": "35205038",
      "meddra_code": "10013968",
      "name": "Dyspnoea"
    },
    {
      "count": 1,
      "id": "35306119",
      "meddra_code": "10036476",
      "name": "Prader-Willi syndrome"
    },
    {
      "count": 1,
      "id": "35406391",
      "meddra_code": "10043882",
      "name": "Tinnitus"
    },
    {
      "count": 1,
      "id": "35707962",
      "meddra_code": "10069049",
      "name": "Gastrointestinal viral infection"
    },
    {
      "count": 1,
      "id": "35708108",
      "meddra_code": "10021518",
      "name": "Impaired gastric emptying"
    }

These are probably extreme off-label uses as best, and data errors at worst.

Given that we have indications from multiple other sources through mychem.info (like ChEMBL and DrugCentral), we could probably remove these edges from the SmartAPI annotations without much loss in content to BTE. Alternatively, we could figure out an appropriate threshold on the count field (using a similar strategy to what we did in https://github.com/NCATSTranslator/Feedback/issues/100. Eventually, this should also be assigned a relatively weak knowledge_level (https://github.com/biothings/biothings_explorer/issues/715) so our scoring can account for it appropriately...

mbrush commented 8 months ago

Thanks for posting this Andrew - a closer look at AEOLUS has been on my list for a while.

From a quick review of their Nature Scientific Data paper, and looking at example records of AEOLUS data in mychem - I concluded that the 'indications' AEOLUS reports are based on FAERS self-reporting data, and reflect what the patient reporting the adverse event said they took the drug for, when reporting the adverse events they experienced. @andrewsu do you agree with this assessment?

If true, I would agree that AEOLUS is not the best source of 'treats' statements - given the existence of other more reliable sources you mention for this type of knowledge.

That said, it could be an interesting source of potential novel off-label usages of drugs - in cases where we see may patients self-reporting taking a drug for a particular non-indicated disease - so it may be worth keeping in Translator.

The key will be to clearly advertise the dubious nature of these claims, to ensure end users and reasoning/scoring tools are appropriately cautious when using this information. As you suggest, knowledge level/agent type tags will play a big role here - as may other 'at-a-glace' EPC properties we have proposed such as 'evidence type'. I think these types of statements would fall into the observation knowledge level bucket.

Finally, note that we have previously documented the AEOLUS use case as an example of how knowledge level and other EPC / AAG properties would work together to represent this information under the refactored approach to modeling treats relationships. Worth a look at the proposal in the screenshot below (and source document here). - to see how we might ultimately choose to handle a source like AEOLUS.

image

andrewsu commented 8 months ago

super @mbrush, I think we are on the same page. And yes, we will definitely follow whatever is specified in the EPC modeling document you linked. Perhaps a suggestion on that... The Ranibizumab - treats - AMD example is helpful (1955 reports in AEOLUS), but just so people don't get tempted to over-trust AEOLUS, it might be useful to also add a poor AEOLUS "prediction" to that doc as well. Many examples to choose from in https://mychem.info/v1/query?q=ranibizumab&fields=aeolus.indications: Ranibizumab - treats - Thrombosis (9 reports) or Ranibizumab - treats - Type 2 diabetes mellitus (1 report) and Ranibizumab - treats - Phlebotomy (1 report)...

And now that we are out of code freeze, I do think we should implement a (hopefully) quick-to-implement stop-gap measure on CI/TEST. @colleenXu can you adjust the aeolus query to include a filter like this? https://mychem.info/v1/query?q=ranibizumab&fields=aeolus.indications&jmespath=aeolus.indications|[?count>`20`]

colleenXu commented 8 months ago

@andrewsu to confirm, you'd like the limit to be > 20?

andrewsu commented 8 months ago

yes, absent evidence to more confidently set that threshold, I think 20 will considerably improve the precision while not substantially degrading recall...

colleenXu commented 8 months ago

@andrewsu

I'm having trouble figuring out the reverse-operation "aeolus MEDDRA disease ID -(treated_by)-> chem". This matters because it's what BTE actually uses in creative-mode "treats", since creative-mode's starting ID is the disease.


@newgene Here's the details. Can you help?

(But I'm not sure if we can solve this. This is similar to a prior discussion on list_filter. Then, we decided that it wasn't really viable: one could do list_filter + JQ OR batch-query starting IDs, but not both)

This is the intended behavior

I want to take a query like this, and only keep the hits (the aeolus field?) when the nested object in aeolus.indication meets the criteria: (1) meddra_code is one of the 3 listed (but it can be up to 1000 IDs in a batch), and (2) the count > 20. ``` curl --location 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii' \ --header 'Content-Type: application/json' \ --data '{ "q": ["10018304", "10058990", "10038867"], "scopes": "aeolus.indications.meddra_code" }' ``` For example, this hit for `10018304` (chemical is unii:F0P408N6V4) doesn't meet the criteria because the specific nested object with `10018304` has a count less than 20. So I'd like to remove this hit completely from the response (or at least the entire aeolus field for this hit). ``` { "query": "10018304", "_id": "F0P408N6V4", "_score": 7.2257814, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [ { "count": 19893, "id": "43053715", "meddra_code": "10035226", "name": "Plasma cell myeloma" }, ... { "count": 1, "id": "35606985", "meddra_code": "10018304", "name": "Glaucoma" }, ... ], "unii": "F0P408N6V4" } }, ```

What I tried, and how I know it isn't doing what I intend

First, I tried doing setting jmespath to ```aeolus.indications|[?count>`20`]``` So the query would be: ``` curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3Fcount%3E%6020%60]' \ --header 'Content-Type: application/json' \ --data '{ "q": ["10018304", "10058990", "10038867"], "scopes": "aeolus.indications.meddra_code" }' ``` But the example unii:F0P408N6V4 is still in the hits, even though its nested object that matched `10018304` is missing (it was filtered out because its count was less than 20).

click to see the unii:F0P408N6V4 hit

``` { "query": "10018304", "_id": "F0P408N6V4", "_score": 7.2257814, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [ { "count": 19893, "id": "43053715", "meddra_code": "10035226", "name": "Plasma cell myeloma" }, { "count": 2306, "id": "35104397", "meddra_code": "10028533", "name": "Myelodysplastic syndrome" }, { "count": 1123, "id": "35104667", "meddra_code": "10028228", "name": "Multiple myeloma" }, { "count": 425, "id": "35104364", "meddra_code": "10008958", "name": "Chronic lymphocytic leukaemia" }, { "count": 364, "id": "35104461", "meddra_code": "10025310", "name": "Lymphoma" }, { "count": 348, "id": "42890454", "meddra_code": "10070592", "name": "Product used for unknown indication" }, { "count": 201, "id": "35104394", "meddra_code": "10068532", "name": "5q minus syndrome" }, { "count": 201, "id": "35104532", "meddra_code": "10061275", "name": "Mantle cell lymphoma" }, { "count": 196, "id": "35104351", "meddra_code": "10000880", "name": "Acute myeloid leukaemia" }, { "count": 186, "id": "36009859", "meddra_code": "10002022", "name": "Amyloidosis" }, { "count": 146, "id": "35104490", "meddra_code": "10012818", "name": "Diffuse large B-cell lymphoma" }, { "count": 142, "id": "35104252", "meddra_code": "10028537", "name": "Myelofibrosis" }, { "count": 138, "id": "35104643", "meddra_code": "10029547", "name": "Non-Hodgkin's lymphoma" }, { "count": 130, "id": "35104465", "meddra_code": "10003899", "name": "B-cell lymphoma" }, { "count": 86, "id": "35124300", "meddra_code": "10068361", "name": "MDS" }, { "count": 58, "id": "35125677", "meddra_code": "10028233", "name": "Multiple myeloma without mention of remission" }, { "count": 56, "id": "43053717", "meddra_code": "10073133", "name": "Plasma cell myeloma recurrent" }, { "count": 47, "id": "35104405", "meddra_code": "10020206", "name": "Hodgkin's disease" }, { "count": 45, "id": "37522153", "meddra_code": "10057097", "name": "Drug use for unknown indication" }, { "count": 38, "id": "43053713", "meddra_code": "10035222", "name": "Plasma cell leukaemia" }, { "count": 34, "id": "35125678", "meddra_code": "10028566", "name": "Myeloma" }, { "count": 33, "id": "35104669", "meddra_code": "10035484", "name": "Plasmacytoma" }, { "count": 29, "id": "35124041", "meddra_code": "10009310", "name": "CLL" }, { "count": 27, "id": "36617702", "meddra_code": "10060862", "name": "Prostate cancer" }, { "count": 27, "id": "42888924", "meddra_code": "10060880", "name": "Monoclonal gammopathy" }, { "count": 26, "id": "35104567", "meddra_code": "10047801", "name": "Waldenstrom's macroglobulinaemia" }, { "count": 25, "id": "35104382", "meddra_code": "10025270", "name": "Lymphocytic leukaemia" }, { "count": 23, "id": "35123953", "meddra_code": "10000886", "name": "Acute myeloid leukemia" } ], "unii": "F0P408N6V4" } }, ```

Trying the following didn't work either: * ```aeolus|[?indications.count>`20`]``` : then all the hits had `aeolus: null` which is incorrect since I know some hits met the criteria (like unii:1O6WQ6T7G3 for `10018304`) * ```.|[?aeolus.indications.count>`20`]``` : then it seemed like the jmespath statement did nothing (no nested objects filtered out)

colleenXu commented 8 months ago

Updates:

@andrewsu

I've implemented jmespath: aeolus.indications|[?count>`20`] for the aeolus-treats operation (chemical X -(treats)-> disease).

However, the reverse operation may be more important (as I said in the previous post). And while I'm making some progress (see below), I'm still not able to implement the count constraint for the reverse operation.

Query for testing: Escitalopram

Based on Andrew's [first post on this issue](https://github.com/biothings/biothings_explorer/issues/727#issue-1898800712) ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["UNII:4O4S742ANY"], "categories":["biolink:SmallMolecule"] }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "e1": { "subject": "n0", "object": "n1", "predicates": ["biolink:treats"] } } } } } ``` Got 110 results before, should now get 29. The low-count hits like Tinnitus (meddra code 10043882) should no longer be in the result set.

Query for testing: Ranibizumab

Based on Andrew's [post above](https://github.com/biothings/biothings_explorer/issues/727#issuecomment-1776153609) ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["UNII:ZL1R02VT79"], "categories":["biolink:SmallMolecule"] }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "e1": { "subject": "n0", "object": "n1", "predicates": ["biolink:treats"] } } } } } ``` Got 120 results before, should now get 41. The low-count hits like thrombosis (meddra code 10043607) should no longer be in the result set.


@newgene

I still need your help, but I think I've made some progress:

click to see what I have

Setting jmespath to ```aeolus.indications|[?(count>`20`) && (meddra_code=='10018304' ||meddra_code=='10038867')]``` (using https://github.com/biothings/biothings.api/commit/31898fac7cd86b5c05520622885a3c0852f2494c as reference) The MyChem query is: ``` curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%2710018304%27%20%7C%7Cmeddra_code%3D%3D%2710038867%27)]' \ --header 'Content-Type: application/json' \ --data '{ "q": ["10018304", "10038867"], "scopes": "aeolus.indications.meddra_code" }' ``` Then the response looks like this for hits that fulfill the criteria: ``` { "query": "10018304", "_id": "WSNODXPBBALQOF-VEJSHDCNSA-N", "_score": 7.2257814, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [ { "count": 157, "id": "35606985", "meddra_code": "10018304", "name": "Glaucoma" } ], "unii": "1O6WQ6T7G3" } }, { "query": "10038867", "_id": "1RXS4UE564", "_score": 8.809106, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [ { "count": 26, "id": "35607414", "meddra_code": "10038867", "name": "Retinal haemorrhage" } ], "unii": "1RXS4UE564" } }, ``` And like this for elements that don't fit the criteria (including the same F0P408N6V4 chemical I had in the last post): ``` { "query": "10018304", "_id": "F0P408N6V4", "_score": 7.2257814, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [], "unii": "F0P408N6V4" } }, { "query": "10038867", "_id": "2S9ZZM9Q9V", "_score": 9.657343, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [], "unii": "2S9ZZM9Q9V" } }, ```

Notes for myself on generating queries like this with x-bte/BTE

* I think doing this as non-batch is easier: * To add the input IDs: `{{ queryInputs }}` can be used in parameters (think external apis like [biolink/monarch](https://github.com/NCATS-Tangerine/translator-api-registry/blob/ac076b85b21415f1fda4fdcdbd6aa1c487e27e81/biolink/openapi.yml#L947)) * May involve some `wrap`, playing around with quotation marks and escaping `\` to get the single-quotes * I'm less sure about being able to generate the batch-queries properly...even though batch-queries are theoretically possible (my example uses 2 meddra_code values) * how many unique values can this BioThings feature handle? * can I figure out how to get the multiple IDs formatted correctly? (`wrap` to generate a string, setting the delimiter to `||`...) * batch-size-limit: caused by the url-character limit * and [this](https://github.com/biothings/bte_trapi_query_graph_handler/blob/2447a5a3bd4aa2b09b0ff503751b5447f6aee216/src/config.ts#L1)'ll be set for the whole-api, unless we implement something for individual operations (which may be a bit complicated by the deployment situation?)

newgene commented 8 months ago

@colleenXu jmespath does not add or remove hits, only transform hits given some critieria. If you want to modify the hits, you should modify your query. In your case above, you can include aeolus.indications.count:>20 into your query, then all hits should contain at least one count>20 item under indications array. This should serve the purpose if I understand correctly.

colleenXu commented 8 months ago

@newgene I tried adding this two ways: using a "no-scopes" query and post_filter. Both didn't seem to work: the responses were basically the same as before.

The responses are basically the same as above

"no-scopes" query and response

``` curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%2710018304%27%20%7C%7Cmeddra_code%3D%3D%2710038867%27)]' \ --header 'Content-Type: application/json' \ --data '{ "q": [ "aeolus.indications.meddra_code:10018304 AND aeolus.indications.count:>20", "aeolus.indications.meddra_code:10038867 AND aeolus.indications.count:>20" ], "scopes": [] }' ``` Response still has the hits that don't meet the criteria: ``` { "query": "aeolus.indications.meddra_code:10018304 AND aeolus.indications.count:>20", "_id": "F0P408N6V4", "_score": 8.225781, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [], "unii": "F0P408N6V4" } }, { "query": "aeolus.indications.meddra_code:10018304 AND aeolus.indications.count:>20", "_id": "2S9ZZM9Q9V", "_score": 7.137364, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [], "unii": "2S9ZZM9Q9V" } }, ```

post-filter

Added post_filter parameter, set to `aeolus.indications.count:>20` ``` curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&post_filter=aeolus.indications.count%3A%3E20&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%2710018304%27%20%7C%7Cmeddra_code%3D%3D%2710038867%27)]' \ --header 'Content-Type: application/json' \ --data '{ "q": ["10018304", "10038867"], "scopes": "aeolus.indications.meddra_code" }' ``` Response still has the hits that don't meet the criteria: ``` { "query": "10018304", "_id": "F0P408N6V4", "_score": 7.2257814, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [], "unii": "F0P408N6V4" } }, { "query": "10018304", "_id": "2S9ZZM9Q9V", "_score": 6.137364, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [], "unii": "2S9ZZM9Q9V" } }, ```

newgene commented 8 months ago

@colleenXu you have additional filter criteria in jmespath as jmespath=aeolus.indications|[?(count>20) && (meddra_code=='10018304' ||meddra_code=='10038867')], so if indications returns as empty, it's due to these criteria, not the count:>20 which you have already filtered out.

colleenXu commented 8 months ago

@newgene

Okay....but I still can't figure out: if the hit's aeolus.indications is empty, how to remove the aeolus.unii field or remove the hit...

(ref: this earlier post)

colleenXu commented 8 months ago

(CC @newgene)

This is the info from our conversation:

We tried setting the q field to be identical to the jmespath info, but it seemed to result in the same behavior as the previous tries.

click for info

So the jmespath parameter is: ```aeolus.indications|[?(count>`20`) && (meddra_code==`10018304`||meddra_code==`10038867`)] ``` And we set the request body to something very similar: ``` { "q": [ "aeolus.indications.count:>20 AND (aeolus.indications.meddra_code:10018304 OR aeolus.indications.meddra_code:10038867)" ], "scopes": [] } ``` so the full query was: ``` curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=aeolus.indications%2Caeolus.unii&jmespath=aeolus.indications%7C[%3F(count%3E%6020%60)%20%26%26%20(meddra_code%3D%3D%6010018304%60%7C%7Cmeddra_code%3D%3D%6010038867%60)]%20' \ --header 'Content-Type: application/json' \ --data '{ "q": [ "aeolus.indications.count:>20 AND (aeolus.indications.meddra_code:10018304 OR aeolus.indications.meddra_code:10038867)" ], "scopes": [] }' ``` And the responses have the same issue: ``` { "query": "aeolus.indications.count:>20 AND (aeolus.indications.meddra_code:10018304 OR aeolus.indications.meddra_code:10038867)", "_id": "F0P408N6V4", "_score": 8.225781, "aeolus": { "_license": "http://bit.ly/2DIxWwF", "indications": [], "unii": "F0P408N6V4" } }, ```

colleenXu commented 2 months ago

The MyChem-query-level limit (aeolus.indications.count > 20) is now implemented in the reverse direction too in Dev/CI!

Adding the new parameter jmespath_exclude_empty: true removed the hits that didn't match both criteria (count > 20 AND meddra field's value matches the input ID) - so BTE can parse the API response without issues. Commits:

Thanks to @newgene @DylanWelzel for the BioThings SDK/MyChem update


So the current situation in Dev/CI:

colleenXu commented 2 months ago

@tokebe @andrewsu

I know we've been discussing the aeolus edge-attribute format (flattening arrays into ints) in the edge-attribute constraint issue (part 1 here, and decision here). But I think it'd be make sense to add it to this issue and track its deployment here.

What do you think?

colleenXu commented 2 months ago

And a note - because the hard-coded limit of > 20 is for individual records, BTE won't return an edge for the following theoretical edge case:

I asked Andrew, and he said that this is fine for now.

colleenXu commented 2 months ago

Addressed by this commit directly to main: https://github.com/biothings/bte_trapi_query_graph_handler/commit/b0fc94d762ad17d277bc6ddfa635ba60cc3e28aa

I've confirmed that the flattening/summation works as-intended :)


Example based on the example in Part 1 here

Example query

Send to MyChem thru BTE: `http://localhost:3000/v1/smartapi/8f08d1446e0bb9c2b323713ce83e2bd3/query` ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["UNII:01K63SUP8D"], "categories":["biolink:SmallMolecule"] }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "e1": { "subject": "n0", "object": "n1", "predicates": ["biolink:applied_to_treat"] } } } } } ```

Previously, we'd get edges from the aeolus operations that look like this:

                "dd9daae5b03bcad0698ff6669090f36b": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MEDDRA:10070592",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": [
                                875
                            ]
                        }
                    ],

                "1feea171db6394cfd9bcb20deae0ad9a": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MONDO:0002050",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": [
                                733,
                                42
                            ]
                        }
                    ],

After the commit, these edges look like this: the edge-attribute values are ints and sums if there were values from multiple records.

                "dd9daae5b03bcad0698ff6669090f36b": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MEDDRA:10070592",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": 875
                        },

                "1feea171db6394cfd9bcb20deae0ad9a": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MONDO:0002050",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": 775
                        },
colleenXu commented 3 weeks ago

The flattening/summing code was deployed today to Prod as part of the Octopus release. I tested and it's live.

Summary of what was done in this issue:

Noting one edge case (pasted from above comment):

And a note - because the hard-coded limit of > 20 is for individual records, BTE won't return an edge for the following theoretical edge case:

  • individual record counts are <20
  • but BTE/NodeNorm would have merged records together and after the flattening/summation, the edge's count would have been > 20

I asked Andrew, and he said that this is fine for now.