new API for GTRx - Githubissues

andrewsu commented 2 years ago

Our partners at RCIGM have created GTRx (http://gtrx.rbsapp.net/), which is "a virtual, automated system for acute management guidance for seriously ill newborns, infants and children with newly diagnosed genetic diseases". GTRx links rare diseases (and genes underlying those rare diseases) to therapies (mostly drugs, but also dietary adjustments, surgery, etc.).

GTRx is currently available only as a website, but they have provided a database dump to me. The dump is a very sparse CSV with ~457 rows/records (one for each unique disease/gene pair) and ~4600 columns (with all interventions, provenance, evidence, etc. captured in columns), so it will take a little work to transform into JSON documents. It would be very valuable to create a new API for this resource.

andrewsu commented 2 years ago

Additional analysis of Record_ID 1390-OMIM:601005 (https://gtrx.radygenomiclab.com/?q=24&i=true NOTE: this URL may not be stable)

The website currently shows 10 interventions
In the database dump, these 10 interventions are found in two areas:
- 9 interventions for which the value of use_group_* is "retain"
- 1 additional intervention added from add_int_*
in most cases, there is a 1:1 mapping between level2_group* and int_description_*. The numbering of int_description_* matches int_class_*, int_link_*, age_use_int*, and likely others.
In one case (level2_group4), one level2_group* corresponds to two int_description_*s ([int_description_7],[int_description_8] (Atenolol/Nadolol)), which represents a single combination intervention comprised of two individual interventions.

Based on my understanding now, I think we have enough information to parse the database dump into an array of "association-style" objects for subsequent API creation...

andrewsu commented 2 years ago

assigned to @mnarayan1

mnarayan1 commented 2 years ago

Parser: https://github.com/mnarayan1/GTRx/blob/main/parser.py Output file: https://github.com/mnarayan1/GTRx/blob/main/gtrx_data.json Sample record:

{   
    "_id": "10983-OMIM:605814", 
    "condition": {
        "condition_name": "CITRULLINEMIA, TYPE II, NEONATAL-ONSET", 
        "freq_per_birth": "Type I citrullinemia is the most common form of the disorder, affecting about 1 in 57,000 people worldwide...", 
        "pattern_of_inheritance": "Autosomal recessive", 
        "alternate_names": ["CITRULLINEMIA, TYPE II, NEONATAL-ONSET"],
        "clinical_summary": ["Citrullinemia is an inherited disorder that causes ammonia and other toxic substances to accumulate in the blood..."]
    }, 
    "gene information": {
        "db_hgnc_gene_id": 10983.0,
        "db_hgnc_gene_symbol": "SLC25A13"
    }, 
    "interventions": [
        {   
            "int_description_3": "Supplement diet with fat-soluble vitamins", 
            "timeframe_int3": "Days or Weeks,Years", 
            "age_use_int3": "Neonate,Infant", 
            "contra_int3": "No", 
            "qualscale_reclass_drug3": "Authoritative published clinical practice guideline", 
            "rev1_eff_reclass_drug3": "Still in Trials / Unproven"
        },
        ...
    ],
    "references": [
        {
            "pmid_title_1": "Current treatment for citrin deficiency during NICCD and adaptation/compensation stages: Strategy to prevent CTLN2.", 
            "pmid_1": 31255436.0, 
            "pmid_date_1": 2019.0,
            "pmid_journal_1": "Molecular genetics and metabolism"
        },
        ...
    ]
}

andrewsu commented 2 years ago

There is a filtering process for the interventions that needs to be added. Let's take the example above for 10983-OMIM:605814 (which corresponds to this web view). The interventions section in the JSON currently has 12 elements, which is reasonable given that there are 12 non-blank columns starting with int_description_. However, the starting point for parsing this file should actually be the set of columns starting with use_group_.

col name	value
use_group_1	Delete
use_group_10	Retain
use_group_11	Retain,Add note about group
use_group_2	Delete
use_group_3	Retain,Add note about group
use_group_4	Delete
use_group_5	Retain,Add note about group
use_group_6	Retain
use_group_7	Retain
use_group_8	Delete
use_group_9	Delete

For now, we only want to capture the six groups that start with "Retain", and those are the ones that are shown on the corresponding webpage. (There is actually a seventh intervention shown on the webpage, but let's ignore that for a moment.)

The 11 groups above can be mapped to the 12 interventions using these columns

col name	value
level2_group10	[int_description_9]
level2_group11	[int_description_12]
level2_group3	[int_description_6]
level2_group5	[int_description_1]
level2_group6	[int_description_2]
level2_group7	[int_description_3]

... and those in turn can be mapped to human readable names using these columns

col name	value
int_description_9	Medium-chain triglycerides
int_description_12	Orthotopic liver transplantation (OLT)
int_description_6	Pyruvic acid
int_description_1	Lactose-free therapeutic formulas
int_description_2	Lipid and protein-rich low-carbohydrate diet
int_description_3	Supplement diet with fat-soluble vitamins

So somewhat confusingly, there are two sets of numberings -- one for interventions (which run from 1-12) and one for intervention groups (which run from 1-11).

Columns relating to interventions

int_class_
int_description_
int_link_

Columns relating to intervention groups

priority_class_drug
qualscale_reclass_drug
rev1_eff_reclass_drug
timeframe_int

Additional complications

There are two added complications that aren't covered above.

First, there is often, but not always, a one-to-one mapping between interventions and intervention groups. For example, in 1390-OMIM:601005, both Atenolol (int_description_7) and Nadolol (int_description_8) are gathered into a single intervention group (level2_group4). You can see that reflected on the web page for this disease.

Second, there is another set of columns for "additional interventions". An example is in 10306-OMIM:615550. The corresponding web page shows 4 interventions (actually intervention groups), but there are only two Retain entries in the use_group_ columns. The two additional interventions are found in these columns:

col name	value
add_int_description_1	corticosteroids
add_int_description_2	Deferoxamine,Deferasirox
add_int_detail_1	Corticosteroids can initially improve the red blood count
add_int_detail_2	This group of medications used to treat iron overload from chronic transfusion. Deferiprone is not recommended in the treatment of iron overload in individuals with DBA [PMID 18671700] because its side effects include neutropenia

andrewsu commented 2 years ago

After discussing with @colleenXu, we agree that we need to break this down to an "association style" API, where each record represents the association between a single disease and a single intervention. Here is a new sample record:

{   
    "_id": "OMIM:605814-8558G7RUTR", 
    "subject": {
        "condition_name": "CITRULLINEMIA, TYPE II, NEONATAL-ONSET", 
        "freq_per_birth": "Type I citrullinemia is the most common form of the disorder, affecting about 1 in 57,000 people worldwide...", 
        "pattern_of_inheritance": "Autosomal recessive", 
        "alternate_names": ["CITRULLINEMIA, TYPE II, NEONATAL-ONSET"],
        "clinical_summary": ["Citrullinemia is an inherited disorder that causes ammonia and other toxic substances to accumulate in the blood..."],
        "omim": 605814
    }, 
    "object": {
        "description": "pyruvic acid",
        "inxight": "8558G7RUTR",
        "int_class": "medicine",
        "level2_group": 3,
        "priority_class": 2,
        "timeframe": "Days or Weeks,Years", 
        "age_use": "Neonate,Infant", 
        "contra": "No", 
        "qualscale_reclass": "Authoritative published clinical practice guideline", 
        "rev1_eff_reclass": "Still in Trials / Unproven"
    },
    "predicate": "treated_by",
    "references": [
        {
            "title": "Current treatment for citrin deficiency during NICCD and adaptation/compensation stages: Strategy to prevent CTLN2.", 
            "pmid": 31255436, 
            "date": 2019,
            "journal": "Molecular genetics and metabolism"
        },
        ...
    ]
}

A few things to note:

removed information about the gene (the gene-disease link is readily available from other sources)
added a predicate key whose value will always be "treated_by"
added object.inxight, parsed from the corresponding int_link_ column
modified the _id to be of the format [OMIM]-[INXIGHT]
add subject.omim
converted several fields that were parsed as floats into ints (e.g, 2019.0 -> 2019)
changed several of the keys to remove the intervention / group number (e.g., pmid_1 -> pmid)
added the object.level2_group key based on the mapping of intervention group to intervention
FYI, duplicate the references section for every intervention-drug pair

And finally, one very important filter. For now, let's only consider interventions that have INXIGHT IDs in the int_link_ columns. If it does not have a link, or if that link goes to redcap.radygenomiclab.com, let's just ignore that intervention. In a separate Github issue, we'll track a potential project to manually map these other interventions (mostly non-drug interventions) to UMLS and/or MESH, but that is out of scope of this issue.

mnarayan1 commented 2 years ago

Updated Parser: https://github.com/mnarayan1/GTRx/blob/main/parser.py Sample Output File: https://github.com/mnarayan1/GTRx/blob/main/gtrx_data.json

Here is a sample record:

{
    '_id': 'OMIM:250250-MZ1IW7Q79D', 
    'subject': {
        'condition_name': 'CARTILAGE-HAIR HYPOPLASIA; CHH', 
        'freq_per_birth': 'Cartilage-hair hypoplasia occurs most often...', 
        'pattern_of_inheritance': 'Autosomal recessive', 
        'clinical_summary': ["Cartilage-hair hypoplasia is a disorder of bone growth..."], 
        'alternate_names': ['ANAUXETIC DYSPLASIA 1; ANXD1', 'CARTILAGE-HAIR HYPOPLASIA; CHH', 'Omenn syndrome', 'METAPHYSEAL DYSPLASIA WITHOUT HYPOTRICHOSIS; MDWH'], 
        'omim': '250250'
    }, 
    'predicate': 'treated_by', 
    'object': {
        'intervention': [{'description': 'Valaciclovir', 'inxight': 'MZ1IW7Q79D', 'int_class': 'medicine'}], 
        'level2_group': '18', 
        'timeframe': 'Hours', 
        'age_use': 'Neonate,Infant,Child', 
        'contra': 'No', 
        'qualscale_reclass': 'Authoritative published clinical practice guideline', 
        'rev1_eff_reclass': 'Effective / Ameliorative'
    }, 
    'references': [
        {
            'title': 'A Systematic Review on Predisposition...', 
            'pmid': 31057537, 
            'date': 2019, 
            'journal': 'Frontiers in immunology'
        },
        ...
    ]
}

andrewsu commented 2 years ago

This looks pretty great to me. I think it's safe to hand over to @erikyao for API creation!

erikyao commented 2 years ago

Hi @andrewsu and @mnarayan1, could you please help determine the "release number" of the data file, GTRx_Joined_Data2-1-2022.xlsx? (like 02-01-2022?) Thanks!

andrewsu commented 2 years ago

I think you can put "2022-02-01" for the version number. Thanks!

erikyao commented 2 years ago

@andrewsu thank you! I've forked the plugin repo to https://github.com/biothings/GTRx. Will deploy soon.

erikyao commented 2 years ago

Hi @mnarayan1 , I got the following error when uploading the documents:

{
  "writeErrors": [
    {
      "index": 6,
      "code": 11000,
      "keyPattern": {
        "_id": 1
      },
      "keyValue": {
        "_id": "OMIM:615550"
      },
      "errmsg": "E11000 duplicate key error collection: pending_src.GTRx_temp_swUhrMVk index: _id_ dup key: { _id: 'OMIM:615550' }",
      "op": {
        "_id": "OMIM:615550",
        "subject": {
          "condition_name": "DIAMOND-BLACKFAN ANEMIA 12",
          "pattern_of_inheritance": "Autosomal dominant",
          "clinical_summary": [...],
          "alternate_names": [
            "DIAMOND-BLACKFAN ANEMIA 12"
          ],
          "omim": "615550"
        },
        "predicate": "treated_by",
        "object": {
          "add_int_description": "corticosteroids",
          "add_int_detail": "Corticosteroids can initially improve the red blood count "
        },
        "references": [...]
      }
    }
  ],
  "writeConcernErrors": [],
  "nInserted": 6,
  "nUpserted": 0,
  "nMatched": 0,
  "nModified": 0,
  "nRemoved": 0,
  "upserted": []
}

Looks like we have multiple documents with _id = "OMIM:615550". Could you please take a look at the data and parser? Thanks!

mnarayan1 commented 2 years ago

There seems to be multiple records that have repeated ids:

Additional interventions don't have an INXIGHT to be added to the _id, so their ids are not unique.
Some conditions have the same OMIM, but different digits in front of their record_id. For example, there are three different records with an OMIM of 603554 (Omenn Syndrome), but they each have a unique 4-digit number in front of them that differentiate them (9832-OMIM:603554, 17642-OMIM:603554, 9831-OMIM:603554).

@andrewsu do you know where to find the INXIGHT for the additional interventions, or is there a specific way you want to deal with them? Also, are you okay with keeping the 4-digit number in front of the _id? An _id would look like 9832-OMIM:603554 instead of OMIM:603554.

andrewsu commented 2 years ago

you can skip any records without inxight IDs
hmm, let's modify the plan for _id to be of the format [record_id]-[INXIGHT] (instead of [OMIM]-[INXIGHT])

erikyao commented 2 years ago

Hi @mnarayan1, please let me know when you fix the problem. Thanks!

mnarayan1 commented 2 years ago

The issue should be fixed, let me know if there are still problems.

erikyao commented 2 years ago

API published: https://biothings.ncats.io/gtrx

About the plugin:

Original Repo: https://github.com/mnarayan1/GTRx
Forked Repo: https://github.com/biothings/GTRx

colleenXu commented 2 years ago

Next step is writing a SmartAPI yaml for this API with x-bte annotation...

andrewsu commented 1 year ago

@mnarayan1 created a PR for the SmartAPI annotation at https://github.com/NCATS-Tangerine/translator-api-registry/pull/109, assigning @colleenXu to review

colleenXu commented 1 year ago

Question: do we need to provide object.level2_group or object.priority_class to users?

Are these fields interesting/useful for them?
Why is the value a string and not an int for level2_group?
Do the values map to keywords/phrases that we could provide instead?
- I notice that the GTRx website for "susceptibility to atypical hemolytic uremic syndrome 4" marks the intervention Eculizumab as "Unsuitable for Neonates under 29 days old or Infants under 24 months old", and I wonder if this phrase corresponds to the level2_group ("5") or priority_class (1) that I see in the BioThings record.

colleenXu commented 1 year ago

Note: going to move forward and merge the SmartAPI PR, then do the next steps of incorporating it into BTE. EDIT: Registered here and PR made to connect to BTE

However, I think the questions/notes on INXIGHT IDs and the questions in the above post should still be addressed by @andrewsu or @mnarayan1 ...

tokebe commented 1 year ago

This API has been deployed to Prod, should the issue be closed?

colleenXu commented 1 year ago

Yep, closing it now

biothings / pending.api

new API for GTRx #62

Additional complications