Closed DanPaseltiner closed 5 months ago
https://ecr.aimsplatform.org/ehr-implementers/triggering/
Trying to suss out a few differences in JSON.
eRSD Version 1 eRSD VERSION 1 RELEASE DESCRIPTION eRSD Version 1 includes the queries necessary for producing CDA eICR R1.1 reports. For eCR Now App Release 2.X.
eRSD Version 2 eRSD VERSION 2 RELEASE DESCRIPTION eRSD Version 2 includes the queries necessary for producing CDA eICR R1.1 reports. It is for users of the eCR Now FHIR App release 3.X and greater.
based on the note in the article about medication, I'm 99% we want v2. However, v2 is "only" 19MB compared to 33.4MB for v1. Doing some further comparison work today.
Also slightly confusing, the API calls https://ersd.aimsplatform.org/api/ersd/v2specification?format=json&api-key=[YOURKEYHERE] produce svelter json (v1 is 19MB, v2 is 12.5MB). I would think for the purposes of updating to get the latest info when building a database, we'll eventually want to implement building off the API endpoint, so I'm gonna also try to do more digging to figure out where/how/why they are different.
For v2, we can drill down to
data['entry'][i]['resource']['resourceType']
where resourceType
== ValueSet
to get down to
However, within those 1350ish ValueSets, there's also these:
>>> for i in data['entry']:
... if i['resource']['resourceType'] == 'ValueSet':
... test = i['resource']
... if test.keys().__len__() > 13:
... print(test.keys())
...
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
>>> for i in data['entry']:
... if i['resource']['resourceType'] == 'ValueSet':
... test = i['resource']
... if test.keys().__len__() > 13:
... print(test['id'], test['purpose'])
...
dxtc Diagnoses or problems documented in a clinical record.
ostc Nominal laboratory result values documented in a clinical record.
lotc Laboratory test names used in orders documented in a clinical record.
lrtc Laboratory test names used in observations documented in a clinical record.
mrtc Prescription drugs names used in observations documented in a clinical record.
sdtc Suspected diagnoses or problems documented in a clinical record.
So, ultimately, I think the way we database this:
>>> print(data['entry'][129]['resource'])
{'resourceType': 'ValueSet', 'id': '2.16.840.1.113762.1.4.1146.182', 'meta': {'profile': ['http://hl7.org/fhir/us/ecr/StructureDefinition/ersd-valueset']}, 'extension': [{'url': 'http://hl7.org/fhir/StructureDefinition/valueset-author', 'valueContactDetail': {'name': 'CSTE Author'}}, {'url': 'http://hl7.org/fhir/StructureDefinition/valueset-steward', 'valueContactDetail': {'name': 'CSTE Steward'}}], 'url': 'http://cts.nlm.nih.gov/fhir/ValueSet/2.16.840.1.113762.1.4.1146.182', 'identifier': [{'system': 'urn:ietf:rfc:3986', 'value': 'urn:oid:2.16.840.1.113762.1.4.1146.182'}], 'name': 'CoccidioidomycosisDisordersICD10CM', 'title': 'Coccidioidomycosis (Disorders) (ICD10CM)', 'status': 'active', 'description': 'Coccidioidomycosis (Disorders) (ICD10CM)', 'useContext': [{'code': {'system': 'http://terminology.hl7.org/CodeSystem/usage-context-type', 'code': 'focus'}, 'valueCodeableConcept': {'coding': [{'system': 'http://snomed.info/sct', 'code': '60826002'}], 'text': 'Coccidioidomycosis (disorder)'}}], 'compose': {'include': [{'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'concept': [{'code': 'B38.0', 'display': 'Acute pulmonary coccidioidomycosis'}, {'code': 'B38.1', 'display': 'Chronic pulmonary coccidioidomycosis'}, {'code': 'B38', 'display': 'Coccidioidomycosis'}, {'code': 'B38.4', 'display': 'Coccidioidomycosis meningitis'}, {'code': 'B38.9', 'display': 'Coccidioidomycosis, unspecified'}, {'code': 'B38.3', 'display': 'Cutaneous coccidioidomycosis'}, {'code': 'B38.7', 'display': 'Disseminated coccidioidomycosis'}, {'code': 'B38.8', 'display': 'Other forms of coccidioidomycosis'}, {'code': 'B38.89', 'display': 'Other forms of coccidioidomycosis'}, {'code': 'B38.81', 'display': 'Prostatic coccidioidomycosis'}, {'code': 'B38.2', 'display': 'Pulmonary coccidioidomycosis, unspecified'}]}]}, 'expansion': {'timestamp': '2024-04-02T12:19:29-04:00', 'contains': [{'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.0', 'display': 'Acute pulmonary coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.1', 'display': 'Chronic pulmonary coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38', 'display': 'Coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.4', 'display': 'Coccidioidomycosis meningitis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.9', 'display': 'Coccidioidomycosis, unspecified'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.3', 'display': 'Cutaneous coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.7', 'display': 'Disseminated coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.8', 'display': 'Other forms of coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.89', 'display': 'Other forms of coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.81', 'display': 'Prostatic coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.2', 'display': 'Pulmonary coccidioidomycosis, unspecified'}]}}
Per conversation with Dan, those conditions can probably just be one table with the value set type (diagnoses, lab, prescription, etc.) in a second table to be able to link through,
The primary benefit of this setup is that it will scale of if more value set types are added. We will need to test to confirm that the two-table format is performant .
Also, per conversation with Dan, v1 of the eRSD, which uses eCR Now App Release 2.X., is not really used by many hospitals, which use v1 of eCR.
v2 of the eRSD is more likely to be relevant, as hospitals that are upgrading are more likely to upgrade to v3 of eCR.
We will first start with this version (attached), but we will still need to do further comparison of the API downloaded version of the file, which is slightly larger (23.5 MB) versus this file's 19MB. The main things we care about to compare:
@DanPaseltiner , attached are the first pass at just creating a rough look at what the two-table structure could look like. I think the main thing to note is the size: value_set.csv is about 10.4MB and conditions.csv is 30.7MB.
conditions.csv can definitely become much smaller. With the way that both files are structured, they have a compose.include
and expansion.contains
sections that I think we both observed when looking last Friday had duplicates. Based on just a quick look, I would say about 50% of the codes in them are duplicates. The main difference between them, as far as I can tell, is just the description. For example, 772212002
appears twice, but in the compose.include
has a Rubella suspected (situation)
while the expansion.contains
is blank. That should be relatively easy to refine and collapse. Below is a quick guesstimate showing we get to a 50% in the size of the file:
>>> c
id name ... code display
0 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 194945009 Acute myocarditis - diphtheritic (disorder)
1 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 15682004 Anterior nasal diphtheria (disorder)
2 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 1086051000119107 Cardiomyopathy due to diphtheria (disorder)
3 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 7773002 Conjunctival diphtheria (disorder)
4 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 18901009 Cutaneous diphtheria (disorder)
... ... ... ... ... ...
87941 2.16.840.1.113762.1.4.1146.2164 MumpsSuspectedDisordersSNOMED ... 772145009 NaN
87942 2.16.840.1.113762.1.4.1146.2165 PertussisSuspectedDisordersSNOMED ... 772146005 Pertussis suspected (situation)
87943 2.16.840.1.113762.1.4.1146.2165 PertussisSuspectedDisordersSNOMED ... 772146005 NaN
87944 2.16.840.1.113762.1.4.1146.1440 RubellaSuspectedDisordersSNOMED ... 772212002 Rubella suspected (situation)
87945 2.16.840.1.113762.1.4.1146.1440 RubellaSuspectedDisordersSNOMED ... 772212002 NaN
[87946 rows x 12 columns]
>>> c.nunique()
id 1346
name 1346
title 1346
status 1
description 1346
use_context_system 1
use_context_code 209
use_context_text 209
system 5
version 6
code 34358
display 34263
dtype: int64
>>> c.drop_duplicates(['id', 'code'])
id name ... code display
0 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 194945009 Acute myocarditis - diphtheritic (disorder)
1 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 15682004 Anterior nasal diphtheria (disorder)
2 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 1086051000119107 Cardiomyopathy due to diphtheria (disorder)
3 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 7773002 Conjunctival diphtheria (disorder)
4 2.16.840.1.113762.1.4.1146.6 DiphtheriaDisordersSNOMED ... 18901009 Cutaneous diphtheria (disorder)
... ... ... ... ... ...
87936 2.16.840.1.113762.1.4.1146.1441 SARSSuspectedDisordersSNOMED ... 12591000132100 Suspected severe acute respiratory syndrome (s...
87937 2.16.840.1.113762.1.4.1146.1441 SARSSuspectedDisordersSNOMED ... 12601000132105 Probable severe acute respiratory syndrome (si...
87940 2.16.840.1.113762.1.4.1146.2164 MumpsSuspectedDisordersSNOMED ... 772145009 Mumps suspected (situation)
87942 2.16.840.1.113762.1.4.1146.2165 PertussisSuspectedDisordersSNOMED ... 772146005 Pertussis suspected (situation)
87944 2.16.840.1.113762.1.4.1146.1440 RubellaSuspectedDisordersSNOMED ... 772212002 Rubella suspected (situation)
[43973 rows x 12 columns]
We can also definitely refine further based off what column names we want/need. Some are definitely redundant or could be compressed into a single json column in whatever schema we ultimately end up with. Curious if this is more or less right for what you think we'll need?
Also, to answer the question from standup, I did a quick aggregate and did not encounter any cases where a single condition had multiple code systems.
are we good to close this ticket? @DanPaseltiner
I copy/pasted the above comment to 1663 to continue that review/conversation: https://app.zenhub.com/workspaces/skylight-dibbs-viper-6480ba23b5a07644e0e46d23/issues/gh/cdcgov/phdi/1663
In order for us to successfully build an API on top of the eRSD values sets we need to make sure that we understand this dataset. Let's take some time to understand the data we are working with. Refer to the epic for links to the eRSD website and a copy of the dataset in XLSX format.