biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

investigate refactoring document structure in semmeddb API #569

Closed andrewsu closed 1 year ago

andrewsu commented 1 year ago

Right now, we have a distinct record for every row in the PREDICATIONS file. For example, we currently have 73 documents for "imatinib TREATS leukemia", each differing by the pmid and predication_id: https://biothings.transltr.io/semmeddb/query?q=subject.umls:C0935989%20AND%20object.umls:C0023418%20AND%20predicate:TREATS

{
  "hits": [
    {
      "_id": "72444465",
      "_score": 19.359053,
      "object": {
        "name": "leukemia",
        "umls": "C0023418"
      },
      "pmid": 24979952,
      "predicate": "TREATS",
      "predication_id": 72444465,
      "subject": {
        "name": "imatinib",
        "umls": "C0935989"
      }
    },
    {
      "_id": "77117922",
      "_score": 19.359053,
      "object": {
        "name": "leukemia",
        "umls": "C0023418"
      },
      "pmid": 23934954,
      "predicate": "TREATS",
      "predication_id": 77117922,
      "subject": {
        "name": "imatinib",
        "umls": "C0935989"
      }
    },
    ...
  ]
}

We could also consider a design in which rows are combined based on unique subjects, predicates, and objects, collapsing the evidence into an array. For the example above:

{
  "took": 5,
  "total": 73,
  "max_score": 19.359053,
  "hits": [
    {
      "_id": "xxxxxxxx",
      "_score": 19.359053,
      "object": {
        "name": "leukemia",
        "umls": "C0023418"
      },
      "predicate": "TREATS",
      "evidence": [
        {
          "pmid": "24979952",
          "predication_id: "72444465"
        },
        {
          "pmid": "23934954",
          "predication_id: "77117922"
        },
        ...
      ],
      "subject": {
        "name": "imatinib",
        "umls": "C0935989"
      }
    }
  ]
}

That would result in fewer documents, but some documents that are potentially very big (especially if we add sentences as is being explored in https://github.com/biothings/BioThings_Explorer_TRAPI/issues/563). BTE does this collapsing eventually anyway, and making that change at the API level would give us the ability to add a pmid_count field (which we could use and dynamically adjust in queries). Creating this issue to track other pros/cons, feasibility analysis, etc.

erikyao commented 1 year ago

It's feasible with a customized RootKeyMergerStorage.

Note that {"on_duplicates" : "merge"} config in the manifest.json (as shown below) is NOT capable of solving this problem because it only finds duplicates on _id.

{
    "uploaders" : [
        { 
            "parser" : "<path.to.module>:<function_name_1>",
            "on_duplicates" : "ignore|error|merge"  # optional, default to "error"
        }
    ]
}
andrewsu commented 1 year ago

At the 2023-03-01 team meeting, we agreed that the proposed structure would be better. We may need to put in some caps on the number of items in the evidence array, since I seem to remember that the count can go 10,000+.

So, let's move forward with this ticket and #563 in parallel. Assigning to @erikyao...

erikyao commented 1 year ago

Discussion on Mar.16 with Andrew


Q: Does it mean we shall group all the documents by the triple of (object.umls, predicate, subject.umls)? Since object.name and subject.name are not standardized across the dataset.

A: I think the answer is yes that we want to "group all the documents by the triple of (object.umls, predicate, subject.umls)". Beyond that, I think we want to do what's easiest during this exploratory period. So I think it'd be great to add all unique values in an array for object.name and subject.name, but choosing one arbitrarily is fine for the moment too.


Q: Would you prefer taking the semtypes into account? I.e. grouping by (object.umls, object.semtpye, predicate, subject.umls, subject.semtype)? Since one UMLS CUI may be mapped to multiple semtypes.

A: My gut says that we can also just combine unique values of semtypes in an array as well, but I'd like @colleenXu to chime in on whether that's an issue for the SmartAPI annotation. If that is a problem, then perhaps we need to aggregate based on unique records for (object.umls, object.semtype, predicate, subject.umls, subject.semtype).

erikyao commented 1 year ago

Sample documents

7 Predications with 7 PMIDs

    {
        '_id': 'C0033011-PART_OF-C0032105',
        'predication_id': [
            27232377,
            31238107,
            40075886,
            72239153,
            72300825,
            72303926,
            72715980
        ],
        'pmid': [27753038, 1268148, 4276847, 23690098, 26287784, 26287791, 24659861],
        'predicate': 'PART_OF',
        'subject': {
            'umls': 'C0033011',
            'name': 'Plasma',
            'semantic_type_abbreviation': 'bdsu',
            'semantic_type_name': 'Body Substance',
            'novelty': 1
        },
        'object': {
            'umls': 'C0032105',
            'name': 'Pregnant Women',
            'semantic_type_abbreviation': 'humn',
            'semantic_type_name': 'Human',
            'novelty': 1
        }
    }

2 Predications with 1 PMID

    {
        '_id': 'C0012524-PART_OF-C0012854',
        'predication_id': [72324048, 72324366],
        'pmid': 16626110,
        'predicate': 'PART_OF',
        'subject': {
            'umls': 'C0012524',
            'name': 'DNA',
            'semantic_type_abbreviation': 'bacs',
            'semantic_type_name': 'Biologically Active Substance',
            'novelty': 1
        },
        'object': {
            'umls': 'C0012854',
            'name': 'Catechol Oxidase',
            'semantic_type_abbreviation': 'aapp',
            'semantic_type_name': 'Amino Acid, Peptide, or Protein',
            'novelty': 1
        }
    },
andrewsu commented 1 year ago

So I'm guessing that most records will be like the first example (7 Predications with 7 PMIDs), or at least most people will assume that is the more common pattern. And I get that it can get even more confusing than the second example -- when we have 3 predications and 2 PMIDs, then we lose the mappings between predications and PMIDs. Having said all that, I think this structure is fine. For now, the pmid field will be interpreted as "supporting articles", and just showing the set of unique values from which the predications are drawn will be just fine.

erikyao commented 1 year ago

For the second example, we can also make it

{
    "predicate": "PART_OF"
    "predication": [
        {
            "predication_id": 72324048,
            "pmid": "16626110",
            # "sentence_id": "omitted"
            # "sentence": "omitted"
        },
        {
            "predication_id": 72324366,
            "pmid": "16626110",
            # "sentence_id": "omitted"
            # "sentence": "omitted"
        },
    ],
    "subject": { ... }
    "object": { ... }
}
andrewsu commented 1 year ago

Oh right, that's even better. That's similar to what I proposed in my initial comment, but apparently that flash of brilliance quickly escaped my brain... ;)

erikyao commented 1 year ago

Code is forked to my own repo: https://github.com/erikyao/semmeddb2. An initial fix is committed already. Will deploy as a new API semmeddb2 tonight.

Considering ITRB might not sync the code change over the weekend. I will redirect access to semmeddb2 to the old biothings.ncats.io server (some NGINX config change needed).

colleenXu commented 1 year ago

I've made a separate SmartAPI yaml registration for this API: https://smart-api.info/registry?q=5634bdb9a1315f85f595979eaf90a504

It has the same operations/MetaEdges as the current SEMMEDDB API.

How to query this API through BTE POST using the api-specific endpoint of dev/CI instances of BTE (or a local instance that is using main branches). Those instances will have qualifier/biolink3-support and other features. For example, one can use the CI-instance of BTE by sending POST requests to this url: `https://bte.ci.transltr.io/v1/smartapi/5634bdb9a1315f85f595979eaf90a504/query`
A complicated query I used for testing uses reverse operations + multiple ID inputs + qualifier support ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["UMLS:C0164786", "UMLS:C1367307"], "categories": ["biolink:Gene"], "name": "Proto-Oncogene Proteins c-akt, STAT3 gene" }, "n1": { "categories": ["biolink:Polypeptide"] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:affected_by"], "qualifier_constraints": [ { "qualifier_set": [ { "qualifier_type_id": "biolink:subject_direction_qualifier", "qualifier_value": "increased" } ] } ] } } } } } ``` Response has the expected info: [Record with UMLS:C0164786 Proto-Oncogene Proteins c-akt](https://biothings.ncats.io/semmeddb2/query?q=subject.semantic_type_abbreviation:aapp%20AND%20subject.umls:C0083735%20AND%20predicate:STIMULATES%20AND%20object.semantic_type_abbreviation:gngm%20AND%20object.umls:C0164786%20AND%20pmid_count:%3E5) How it looks in the response: ![Screen Shot 2023-03-23 at 5 20 09 PM](https://user-images.githubusercontent.com/43731687/227392368-09501ffd-3bfa-4f30-bb73-dab01d4e6a38.png)

The main differences are:

andrewsu commented 1 year ago

closing this one as complete

colleenXu commented 1 year ago

Updated the notebook / yamls https://github.com/NCATS-Tangerine/translator-api-registry/commit/794ffc25f5f6cfbb3c27b5fd87dcb485d19ac610 + registration so BTE will ingest the supporting sentences and so BTE will use the filter >1 (agreed upon by Translator (Slack link)).

Next step is moving BTE from using the older "semmeddb" to using "semmeddb2"...

colleenXu commented 1 year ago

The separate registration linked above has been DELETED. From now on, use the semmeddb registration ID to access "semmeddb2" functionality https://smart-api.info/registry?q=1d288b3a3caf75d541ffaae3aab386c8.

This is because we decided that we liked this refactored document structure, and we want BTE to use it. The simplest way to switch BTE over to using semmeddb2 is to modify the registration of "semmeddb" that BTE is currently hooked up to.

This commit was done by replacing the contents of the "semmeddb" files:

Then the registration for "semmedb" was refreshed, and the old semmeddb registration was deleted.