biothings / pending.api

Set of standalone APIs built with the BioThings SDK for the Translator Project
https://biothings.ncats.io
Apache License 2.0
5 stars 11 forks source link

new API for GTRx #62

Closed andrewsu closed 1 year ago

andrewsu commented 2 years ago

Our partners at RCIGM have created GTRx (http://gtrx.rbsapp.net/), which is "a virtual, automated system for acute management guidance for seriously ill newborns, infants and children with newly diagnosed genetic diseases". GTRx links rare diseases (and genes underlying those rare diseases) to therapies (mostly drugs, but also dietary adjustments, surgery, etc.).

GTRx is currently available only as a website, but they have provided a database dump to me. The dump is a very sparse CSV with ~457 rows/records (one for each unique disease/gene pair) and ~4600 columns (with all interventions, provenance, evidence, etc. captured in columns), so it will take a little work to transform into JSON documents. It would be very valuable to create a new API for this resource.

andrewsu commented 2 years ago

Additional analysis of Record_ID 1390-OMIM:601005 (https://gtrx.radygenomiclab.com/?q=24&i=true NOTE: this URL may not be stable)

Based on my understanding now, I think we have enough information to parse the database dump into an array of "association-style" objects for subsequent API creation...

andrewsu commented 2 years ago

assigned to @mnarayan1

mnarayan1 commented 2 years ago

Parser: https://github.com/mnarayan1/GTRx/blob/main/parser.py Output file: https://github.com/mnarayan1/GTRx/blob/main/gtrx_data.json Sample record:

{   
    "_id": "10983-OMIM:605814", 
    "condition": {
        "condition_name": "CITRULLINEMIA, TYPE II, NEONATAL-ONSET", 
        "freq_per_birth": "Type I citrullinemia is the most common form of the disorder, affecting about 1 in 57,000 people worldwide...", 
        "pattern_of_inheritance": "Autosomal recessive", 
        "alternate_names": ["CITRULLINEMIA, TYPE II, NEONATAL-ONSET"],
        "clinical_summary": ["Citrullinemia is an inherited disorder that causes ammonia and other toxic substances to accumulate in the blood..."]
    }, 
    "gene information": {
        "db_hgnc_gene_id": 10983.0,
        "db_hgnc_gene_symbol": "SLC25A13"
    }, 
    "interventions": [
        {   
            "int_description_3": "Supplement diet with fat-soluble vitamins", 
            "timeframe_int3": "Days or Weeks,Years", 
            "age_use_int3": "Neonate,Infant", 
            "contra_int3": "No", 
            "qualscale_reclass_drug3": "Authoritative published clinical practice guideline", 
            "rev1_eff_reclass_drug3": "Still in Trials / Unproven"
        },
        ...
    ],
    "references": [
        {
            "pmid_title_1": "Current treatment for citrin deficiency during NICCD and adaptation/compensation stages: Strategy to prevent CTLN2.", 
            "pmid_1": 31255436.0, 
            "pmid_date_1": 2019.0,
            "pmid_journal_1": "Molecular genetics and metabolism"
        },
        ...
    ]
}
andrewsu commented 2 years ago

There is a filtering process for the interventions that needs to be added. Let's take the example above for 10983-OMIM:605814 (which corresponds to this web view). The interventions section in the JSON currently has 12 elements, which is reasonable given that there are 12 non-blank columns starting with int_description_. However, the starting point for parsing this file should actually be the set of columns starting with use_group_.

col name value
use_group_1 Delete
use_group_10 Retain
use_group_11 Retain,Add note about group
use_group_2 Delete
use_group_3 Retain,Add note about group
use_group_4 Delete
use_group_5 Retain,Add note about group
use_group_6 Retain
use_group_7 Retain
use_group_8 Delete
use_group_9 Delete

For now, we only want to capture the six groups that start with "Retain", and those are the ones that are shown on the corresponding webpage. (There is actually a seventh intervention shown on the webpage, but let's ignore that for a moment.)

The 11 groups above can be mapped to the 12 interventions using these columns

col name value
level2_group10 [int_description_9]
level2_group11 [int_description_12]
level2_group3 [int_description_6]
level2_group5 [int_description_1]
level2_group6 [int_description_2]
level2_group7 [int_description_3]

... and those in turn can be mapped to human readable names using these columns

col name value
int_description_9 Medium-chain triglycerides
int_description_12 Orthotopic liver transplantation (OLT)
int_description_6 Pyruvic acid
int_description_1 Lactose-free therapeutic formulas
int_description_2 Lipid and protein-rich low-carbohydrate diet
int_description_3 Supplement diet with fat-soluble vitamins

So somewhat confusingly, there are two sets of numberings -- one for interventions (which run from 1-12) and one for intervention groups (which run from 1-11).

Columns relating to interventions

Columns relating to intervention groups


Additional complications

There are two added complications that aren't covered above.

First, there is often, but not always, a one-to-one mapping between interventions and intervention groups. For example, in 1390-OMIM:601005, both Atenolol (int_description_7) and Nadolol (int_description_8) are gathered into a single intervention group (level2_group4). You can see that reflected on the web page for this disease.

image

Second, there is another set of columns for "additional interventions". An example is in 10306-OMIM:615550. The corresponding web page shows 4 interventions (actually intervention groups), but there are only two Retain entries in the use_group_ columns. The two additional interventions are found in these columns:

col name value
add_int_description_1 corticosteroids
add_int_description_2 Deferoxamine,Deferasirox
add_int_detail_1 Corticosteroids can initially improve the red blood count
add_int_detail_2 This group of medications used to treat iron overload from chronic transfusion. Deferiprone is not recommended in the treatment of iron overload in individuals with DBA [PMID 18671700] because its side effects include neutropenia
andrewsu commented 2 years ago

After discussing with @colleenXu, we agree that we need to break this down to an "association style" API, where each record represents the association between a single disease and a single intervention. Here is a new sample record:

{   
    "_id": "OMIM:605814-8558G7RUTR", 
    "subject": {
        "condition_name": "CITRULLINEMIA, TYPE II, NEONATAL-ONSET", 
        "freq_per_birth": "Type I citrullinemia is the most common form of the disorder, affecting about 1 in 57,000 people worldwide...", 
        "pattern_of_inheritance": "Autosomal recessive", 
        "alternate_names": ["CITRULLINEMIA, TYPE II, NEONATAL-ONSET"],
        "clinical_summary": ["Citrullinemia is an inherited disorder that causes ammonia and other toxic substances to accumulate in the blood..."],
        "omim": 605814
    }, 
    "object": {
        "description": "pyruvic acid",
        "inxight": "8558G7RUTR",
        "int_class": "medicine",
        "level2_group": 3,
        "priority_class": 2,
        "timeframe": "Days or Weeks,Years", 
        "age_use": "Neonate,Infant", 
        "contra": "No", 
        "qualscale_reclass": "Authoritative published clinical practice guideline", 
        "rev1_eff_reclass": "Still in Trials / Unproven"
    },
    "predicate": "treated_by",
    "references": [
        {
            "title": "Current treatment for citrin deficiency during NICCD and adaptation/compensation stages: Strategy to prevent CTLN2.", 
            "pmid": 31255436, 
            "date": 2019,
            "journal": "Molecular genetics and metabolism"
        },
        ...
    ]
}

A few things to note:

And finally, one very important filter. For now, let's only consider interventions that have INXIGHT IDs in the int_link_ columns. If it does not have a link, or if that link goes to redcap.radygenomiclab.com, let's just ignore that intervention. In a separate Github issue, we'll track a potential project to manually map these other interventions (mostly non-drug interventions) to UMLS and/or MESH, but that is out of scope of this issue.

mnarayan1 commented 2 years ago

Updated Parser: https://github.com/mnarayan1/GTRx/blob/main/parser.py Sample Output File: https://github.com/mnarayan1/GTRx/blob/main/gtrx_data.json

Here is a sample record:

{
    '_id': 'OMIM:250250-MZ1IW7Q79D', 
    'subject': {
        'condition_name': 'CARTILAGE-HAIR HYPOPLASIA; CHH', 
        'freq_per_birth': 'Cartilage-hair hypoplasia occurs most often...', 
        'pattern_of_inheritance': 'Autosomal recessive', 
        'clinical_summary': ["Cartilage-hair hypoplasia is a disorder of bone growth..."], 
        'alternate_names': ['ANAUXETIC DYSPLASIA 1; ANXD1', 'CARTILAGE-HAIR HYPOPLASIA; CHH', 'Omenn syndrome', 'METAPHYSEAL DYSPLASIA WITHOUT HYPOTRICHOSIS; MDWH'], 
        'omim': '250250'
    }, 
    'predicate': 'treated_by', 
    'object': {
        'intervention': [{'description': 'Valaciclovir', 'inxight': 'MZ1IW7Q79D', 'int_class': 'medicine'}], 
        'level2_group': '18', 
        'timeframe': 'Hours', 
        'age_use': 'Neonate,Infant,Child', 
        'contra': 'No', 
        'qualscale_reclass': 'Authoritative published clinical practice guideline', 
        'rev1_eff_reclass': 'Effective / Ameliorative'
    }, 
    'references': [
        {
            'title': 'A Systematic Review on Predisposition...', 
            'pmid': 31057537, 
            'date': 2019, 
            'journal': 'Frontiers in immunology'
        },
        ...
    ]
}
andrewsu commented 2 years ago

This looks pretty great to me. I think it's safe to hand over to @erikyao for API creation!

erikyao commented 2 years ago

Hi @andrewsu and @mnarayan1, could you please help determine the "release number" of the data file, GTRx_Joined_Data2-1-2022.xlsx? (like 02-01-2022?) Thanks!

andrewsu commented 2 years ago

I think you can put "2022-02-01" for the version number. Thanks!

erikyao commented 2 years ago

@andrewsu thank you! I've forked the plugin repo to https://github.com/biothings/GTRx. Will deploy soon.

erikyao commented 2 years ago

Hi @mnarayan1 , I got the following error when uploading the documents:

{
  "writeErrors": [
    {
      "index": 6,
      "code": 11000,
      "keyPattern": {
        "_id": 1
      },
      "keyValue": {
        "_id": "OMIM:615550"
      },
      "errmsg": "E11000 duplicate key error collection: pending_src.GTRx_temp_swUhrMVk index: _id_ dup key: { _id: 'OMIM:615550' }",
      "op": {
        "_id": "OMIM:615550",
        "subject": {
          "condition_name": "DIAMOND-BLACKFAN ANEMIA 12",
          "pattern_of_inheritance": "Autosomal dominant",
          "clinical_summary": [...],
          "alternate_names": [
            "DIAMOND-BLACKFAN ANEMIA 12"
          ],
          "omim": "615550"
        },
        "predicate": "treated_by",
        "object": {
          "add_int_description": "corticosteroids",
          "add_int_detail": "Corticosteroids can initially improve the red blood count "
        },
        "references": [...]
      }
    }
  ],
  "writeConcernErrors": [],
  "nInserted": 6,
  "nUpserted": 0,
  "nMatched": 0,
  "nModified": 0,
  "nRemoved": 0,
  "upserted": []
}

Looks like we have multiple documents with _id = "OMIM:615550". Could you please take a look at the data and parser? Thanks!

mnarayan1 commented 2 years ago

There seems to be multiple records that have repeated ids:

  1. Additional interventions don't have an INXIGHT to be added to the _id, so their ids are not unique.
  2. Some conditions have the same OMIM, but different digits in front of their record_id. For example, there are three different records with an OMIM of 603554 (Omenn Syndrome), but they each have a unique 4-digit number in front of them that differentiate them (9832-OMIM:603554, 17642-OMIM:603554, 9831-OMIM:603554).

@andrewsu do you know where to find the INXIGHT for the additional interventions, or is there a specific way you want to deal with them? Also, are you okay with keeping the 4-digit number in front of the _id? An _id would look like 9832-OMIM:603554 instead of OMIM:603554.

andrewsu commented 2 years ago
  1. you can skip any records without inxight IDs
  2. hmm, let's modify the plan for _id to be of the format [record_id]-[INXIGHT] (instead of [OMIM]-[INXIGHT])
erikyao commented 2 years ago

Hi @mnarayan1, please let me know when you fix the problem. Thanks!

mnarayan1 commented 2 years ago

The issue should be fixed, let me know if there are still problems.

erikyao commented 2 years ago

API published: https://biothings.ncats.io/gtrx

About the plugin:

colleenXu commented 2 years ago

Next step is writing a SmartAPI yaml for this API with x-bte annotation...

andrewsu commented 1 year ago

@mnarayan1 created a PR for the SmartAPI annotation at https://github.com/NCATS-Tangerine/translator-api-registry/pull/109, assigning @colleenXu to review

colleenXu commented 1 year ago

Question: do we need to provide object.level2_group or object.priority_class to users?

colleenXu commented 1 year ago

Note: going to move forward and merge the SmartAPI PR, then do the next steps of incorporating it into BTE. EDIT: Registered here and PR made to connect to BTE

However, I think the questions/notes on INXIGHT IDs and the questions in the above post should still be addressed by @andrewsu or @mnarayan1 ...

tokebe commented 1 year ago

This API has been deployed to Prod, should the issue be closed?

colleenXu commented 1 year ago

Yep, closing it now