Closed andrewsu closed 1 year ago
Additional analysis of Record_ID 1390-OMIM:601005 (https://gtrx.radygenomiclab.com/?q=24&i=true NOTE: this URL may not be stable)
use_group_*
is "retain"add_int_*
level2_group*
and int_description_*
. The numbering of int_description_*
matches int_class_*
, int_link_*
, age_use_int*
, and likely others.level2_group4
), one level2_group*
corresponds to two int_description_*
s ([int_description_7],[int_description_8]
(Atenolol/Nadolol)), which represents a single combination intervention comprised of two individual interventions.Based on my understanding now, I think we have enough information to parse the database dump into an array of "association-style" objects for subsequent API creation...
assigned to @mnarayan1
Parser: https://github.com/mnarayan1/GTRx/blob/main/parser.py Output file: https://github.com/mnarayan1/GTRx/blob/main/gtrx_data.json Sample record:
{
"_id": "10983-OMIM:605814",
"condition": {
"condition_name": "CITRULLINEMIA, TYPE II, NEONATAL-ONSET",
"freq_per_birth": "Type I citrullinemia is the most common form of the disorder, affecting about 1 in 57,000 people worldwide...",
"pattern_of_inheritance": "Autosomal recessive",
"alternate_names": ["CITRULLINEMIA, TYPE II, NEONATAL-ONSET"],
"clinical_summary": ["Citrullinemia is an inherited disorder that causes ammonia and other toxic substances to accumulate in the blood..."]
},
"gene information": {
"db_hgnc_gene_id": 10983.0,
"db_hgnc_gene_symbol": "SLC25A13"
},
"interventions": [
{
"int_description_3": "Supplement diet with fat-soluble vitamins",
"timeframe_int3": "Days or Weeks,Years",
"age_use_int3": "Neonate,Infant",
"contra_int3": "No",
"qualscale_reclass_drug3": "Authoritative published clinical practice guideline",
"rev1_eff_reclass_drug3": "Still in Trials / Unproven"
},
...
],
"references": [
{
"pmid_title_1": "Current treatment for citrin deficiency during NICCD and adaptation/compensation stages: Strategy to prevent CTLN2.",
"pmid_1": 31255436.0,
"pmid_date_1": 2019.0,
"pmid_journal_1": "Molecular genetics and metabolism"
},
...
]
}
There is a filtering process for the interventions that needs to be added. Let's take the example above for 10983-OMIM:605814
(which corresponds to this web view). The interventions
section in the JSON currently has 12 elements, which is reasonable given that there are 12 non-blank columns starting with int_description_
. However, the starting point for parsing this file should actually be the set of columns starting with use_group_
.
col name | value |
---|---|
use_group_1 | Delete |
use_group_10 | Retain |
use_group_11 | Retain,Add note about group |
use_group_2 | Delete |
use_group_3 | Retain,Add note about group |
use_group_4 | Delete |
use_group_5 | Retain,Add note about group |
use_group_6 | Retain |
use_group_7 | Retain |
use_group_8 | Delete |
use_group_9 | Delete |
For now, we only want to capture the six groups that start with "Retain", and those are the ones that are shown on the corresponding webpage. (There is actually a seventh intervention shown on the webpage, but let's ignore that for a moment.)
The 11 groups above can be mapped to the 12 interventions using these columns
col name | value |
---|---|
level2_group10 | [int_description_9] |
level2_group11 | [int_description_12] |
level2_group3 | [int_description_6] |
level2_group5 | [int_description_1] |
level2_group6 | [int_description_2] |
level2_group7 | [int_description_3] |
... and those in turn can be mapped to human readable names using these columns
col name | value |
---|---|
int_description_9 | Medium-chain triglycerides |
int_description_12 | Orthotopic liver transplantation (OLT) |
int_description_6 | Pyruvic acid |
int_description_1 | Lactose-free therapeutic formulas |
int_description_2 | Lipid and protein-rich low-carbohydrate diet |
int_description_3 | Supplement diet with fat-soluble vitamins |
So somewhat confusingly, there are two sets of numberings -- one for interventions (which run from 1-12) and one for intervention groups (which run from 1-11).
Columns relating to interventions
int_class_
int_description_
int_link_
Columns relating to intervention groups
priority_class_drug
qualscale_reclass_drug
rev1_eff_reclass_drug
timeframe_int
There are two added complications that aren't covered above.
First, there is often, but not always, a one-to-one mapping between interventions and intervention groups. For example, in 1390-OMIM:601005
, both Atenolol (int_description_7
) and Nadolol (int_description_8
) are gathered into a single intervention group (level2_group4
). You can see that reflected on the web page for this disease.
Second, there is another set of columns for "additional interventions". An example is in 10306-OMIM:615550
. The corresponding web page shows 4 interventions (actually intervention groups), but there are only two Retain
entries in the use_group_
columns. The two additional interventions are found in these columns:
col name | value |
---|---|
add_int_description_1 | corticosteroids |
add_int_description_2 | Deferoxamine,Deferasirox |
add_int_detail_1 | Corticosteroids can initially improve the red blood count |
add_int_detail_2 | This group of medications used to treat iron overload from chronic transfusion. Deferiprone is not recommended in the treatment of iron overload in individuals with DBA [PMID 18671700] because its side effects include neutropenia |
After discussing with @colleenXu, we agree that we need to break this down to an "association style" API, where each record represents the association between a single disease and a single intervention. Here is a new sample record:
{
"_id": "OMIM:605814-8558G7RUTR",
"subject": {
"condition_name": "CITRULLINEMIA, TYPE II, NEONATAL-ONSET",
"freq_per_birth": "Type I citrullinemia is the most common form of the disorder, affecting about 1 in 57,000 people worldwide...",
"pattern_of_inheritance": "Autosomal recessive",
"alternate_names": ["CITRULLINEMIA, TYPE II, NEONATAL-ONSET"],
"clinical_summary": ["Citrullinemia is an inherited disorder that causes ammonia and other toxic substances to accumulate in the blood..."],
"omim": 605814
},
"object": {
"description": "pyruvic acid",
"inxight": "8558G7RUTR",
"int_class": "medicine",
"level2_group": 3,
"priority_class": 2,
"timeframe": "Days or Weeks,Years",
"age_use": "Neonate,Infant",
"contra": "No",
"qualscale_reclass": "Authoritative published clinical practice guideline",
"rev1_eff_reclass": "Still in Trials / Unproven"
},
"predicate": "treated_by",
"references": [
{
"title": "Current treatment for citrin deficiency during NICCD and adaptation/compensation stages: Strategy to prevent CTLN2.",
"pmid": 31255436,
"date": 2019,
"journal": "Molecular genetics and metabolism"
},
...
]
}
A few things to note:
predicate
key whose value will always be "treated_by"object.inxight
, parsed from the corresponding int_link_
column_id
to be of the format [OMIM]-[INXIGHT]
subject.omim
2019.0
-> 2019
)pmid_1
-> pmid
)object.level2_group
key based on the mapping of intervention group to interventionreferences
section for every intervention-drug pairAnd finally, one very important filter. For now, let's only consider interventions that have INXIGHT IDs in the int_link_
columns. If it does not have a link, or if that link goes to redcap.radygenomiclab.com
, let's just ignore that intervention. In a separate Github issue, we'll track a potential project to manually map these other interventions (mostly non-drug interventions) to UMLS and/or MESH, but that is out of scope of this issue.
Updated Parser: https://github.com/mnarayan1/GTRx/blob/main/parser.py Sample Output File: https://github.com/mnarayan1/GTRx/blob/main/gtrx_data.json
Here is a sample record:
{
'_id': 'OMIM:250250-MZ1IW7Q79D',
'subject': {
'condition_name': 'CARTILAGE-HAIR HYPOPLASIA; CHH',
'freq_per_birth': 'Cartilage-hair hypoplasia occurs most often...',
'pattern_of_inheritance': 'Autosomal recessive',
'clinical_summary': ["Cartilage-hair hypoplasia is a disorder of bone growth..."],
'alternate_names': ['ANAUXETIC DYSPLASIA 1; ANXD1', 'CARTILAGE-HAIR HYPOPLASIA; CHH', 'Omenn syndrome', 'METAPHYSEAL DYSPLASIA WITHOUT HYPOTRICHOSIS; MDWH'],
'omim': '250250'
},
'predicate': 'treated_by',
'object': {
'intervention': [{'description': 'Valaciclovir', 'inxight': 'MZ1IW7Q79D', 'int_class': 'medicine'}],
'level2_group': '18',
'timeframe': 'Hours',
'age_use': 'Neonate,Infant,Child',
'contra': 'No',
'qualscale_reclass': 'Authoritative published clinical practice guideline',
'rev1_eff_reclass': 'Effective / Ameliorative'
},
'references': [
{
'title': 'A Systematic Review on Predisposition...',
'pmid': 31057537,
'date': 2019,
'journal': 'Frontiers in immunology'
},
...
]
}
This looks pretty great to me. I think it's safe to hand over to @erikyao for API creation!
Hi @andrewsu and @mnarayan1, could you please help determine the "release number" of the data file, GTRx_Joined_Data2-1-2022.xlsx
? (like 02-01-2022
?) Thanks!
I think you can put "2022-02-01" for the version number. Thanks!
@andrewsu thank you! I've forked the plugin repo to https://github.com/biothings/GTRx. Will deploy soon.
Hi @mnarayan1 , I got the following error when uploading the documents:
{
"writeErrors": [
{
"index": 6,
"code": 11000,
"keyPattern": {
"_id": 1
},
"keyValue": {
"_id": "OMIM:615550"
},
"errmsg": "E11000 duplicate key error collection: pending_src.GTRx_temp_swUhrMVk index: _id_ dup key: { _id: 'OMIM:615550' }",
"op": {
"_id": "OMIM:615550",
"subject": {
"condition_name": "DIAMOND-BLACKFAN ANEMIA 12",
"pattern_of_inheritance": "Autosomal dominant",
"clinical_summary": [...],
"alternate_names": [
"DIAMOND-BLACKFAN ANEMIA 12"
],
"omim": "615550"
},
"predicate": "treated_by",
"object": {
"add_int_description": "corticosteroids",
"add_int_detail": "Corticosteroids can initially improve the red blood count "
},
"references": [...]
}
}
],
"writeConcernErrors": [],
"nInserted": 6,
"nUpserted": 0,
"nMatched": 0,
"nModified": 0,
"nRemoved": 0,
"upserted": []
}
Looks like we have multiple documents with _id = "OMIM:615550"
. Could you please take a look at the data and parser? Thanks!
There seems to be multiple records that have repeated ids:
INXIGHT
to be added to the _id
, so their ids are not unique. OMIM
, but different digits in front of their record_id
. For example, there are three different records with an OMIM of 603554 (Omenn Syndrome), but they each have a unique 4-digit number in front of them that differentiate them (9832-OMIM:603554, 17642-OMIM:603554, 9831-OMIM:603554).@andrewsu do you know where to find the INXIGHT
for the additional interventions, or is there a specific way you want to deal with them? Also, are you okay with keeping the 4-digit number in front of the _id
? An _id
would look like 9832-OMIM:603554
instead of OMIM:603554
.
_id
to be of the format [record_id]-[INXIGHT]
(instead of [OMIM]-[INXIGHT]
)Hi @mnarayan1, please let me know when you fix the problem. Thanks!
The issue should be fixed, let me know if there are still problems.
API published: https://biothings.ncats.io/gtrx
About the plugin:
Next step is writing a SmartAPI yaml for this API with x-bte annotation...
@mnarayan1 created a PR for the SmartAPI annotation at https://github.com/NCATS-Tangerine/translator-api-registry/pull/109, assigning @colleenXu to review
Question: do we need to provide object.level2_group
or object.priority_class
to users?
level2_group
? Note: going to move forward and merge the SmartAPI PR, then do the next steps of incorporating it into BTE. EDIT: Registered here and PR made to connect to BTE
However, I think the questions/notes on INXIGHT IDs and the questions in the above post should still be addressed by @andrewsu or @mnarayan1 ...
This API has been deployed to Prod, should the issue be closed?
Yep, closing it now
Our partners at RCIGM have created GTRx (http://gtrx.rbsapp.net/), which is "a virtual, automated system for acute management guidance for seriously ill newborns, infants and children with newly diagnosed genetic diseases". GTRx links rare diseases (and genes underlying those rare diseases) to therapies (mostly drugs, but also dietary adjustments, surgery, etc.).
GTRx is currently available only as a website, but they have provided a database dump to me. The dump is a very sparse CSV with ~457 rows/records (one for each unique disease/gene pair) and ~4600 columns (with all interventions, provenance, evidence, etc. captured in columns), so it will take a little work to transform into JSON documents. It would be very valuable to create a new API for this resource.