SPIKE: Investigate eRSD value sets to understand the structure of the data

DanPaseltiner commented 6 months ago

In order for us to successfully build an API on top of the eRSD values sets we need to make sure that we understand this dataset. Let's take some time to understand the data we are working with. Refer to the epic for links to the eRSD website and a copy of the dataset in XLSX format.

robertandremitchell commented 6 months ago

https://ecr.aimsplatform.org/ehr-implementers/triggering/

Trying to suss out a few differences in JSON.

eRSD Version 1 eRSD VERSION 1 RELEASE DESCRIPTION eRSD Version 1 includes the queries necessary for producing CDA eICR R1.1 reports. For eCR Now App Release 2.X.

eRSD Version 2 eRSD VERSION 2 RELEASE DESCRIPTION eRSD Version 2 includes the queries necessary for producing CDA eICR R1.1 reports. It is for users of the eCR Now FHIR App release 3.X and greater.

based on the note in the article about medication, I'm 99% we want v2. However, v2 is "only" 19MB compared to 33.4MB for v1. Doing some further comparison work today.

Also slightly confusing, the API calls https://ersd.aimsplatform.org/api/ersd/v2specification?format=json&api-key=[YOURKEYHERE] produce svelter json (v1 is 19MB, v2 is 12.5MB). I would think for the purposes of updating to get the latest info when building a database, we'll eventually want to implement building off the API endpoint, so I'm gonna also try to do more digging to figure out where/how/why they are different.

robertandremitchell commented 6 months ago

For v2, we can drill down to data['entry'][i]['resource']['resourceType'] where resourceType == ValueSet to get down to

However, within those 1350ish ValueSets, there's also these:

>>> for i in data['entry']:
...     if i['resource']['resourceType'] == 'ValueSet':
...             test = i['resource']
...             if test.keys().__len__() > 13:
...                     print(test.keys())
... 
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
dict_keys(['resourceType', 'id', 'meta', 'extension', 'url', 'identifier', 'version', 'name', 'title', 'status', 'date', 'description', 'useContext', 'purpose', 'compose', 'expansion'])
>>> for i in data['entry']:
...     if i['resource']['resourceType'] == 'ValueSet':
...             test = i['resource']
...             if test.keys().__len__() > 13:
...                     print(test['id'], test['purpose'])
... 
dxtc Diagnoses or problems documented in a clinical record.
ostc Nominal laboratory result values documented in a clinical record.
lotc Laboratory test names used in orders documented in a clinical record.
lrtc Laboratory test names used in observations documented in a clinical record.
mrtc Prescription drugs names used in observations documented in a clinical record.
sdtc Suspected diagnoses or problems documented in a clinical record.

robertandremitchell commented 6 months ago

So, ultimately, I think the way we database this:

Take these 6 ValueSets. They're comprised of the all the codes relevant to their particular data type.
Build an initial table for each of these 6 containing the condition code, type of code (SNOMED, ICD-10), etc. It may be okay to have that be on table with just the type of id referenced, but there are A LOT of codes, so it may makes sense to have one table per type?

We should then be able to use the links there from these tables to link to the condition by going something like this for each of the other codes and create a table based on the data we'd need to keep.

>>> print(data['entry'][129]['resource'])
{'resourceType': 'ValueSet', 'id': '2.16.840.1.113762.1.4.1146.182', 'meta': {'profile': ['http://hl7.org/fhir/us/ecr/StructureDefinition/ersd-valueset']}, 'extension': [{'url': 'http://hl7.org/fhir/StructureDefinition/valueset-author', 'valueContactDetail': {'name': 'CSTE Author'}}, {'url': 'http://hl7.org/fhir/StructureDefinition/valueset-steward', 'valueContactDetail': {'name': 'CSTE Steward'}}], 'url': 'http://cts.nlm.nih.gov/fhir/ValueSet/2.16.840.1.113762.1.4.1146.182', 'identifier': [{'system': 'urn:ietf:rfc:3986', 'value': 'urn:oid:2.16.840.1.113762.1.4.1146.182'}], 'name': 'CoccidioidomycosisDisordersICD10CM', 'title': 'Coccidioidomycosis (Disorders) (ICD10CM)', 'status': 'active', 'description': 'Coccidioidomycosis (Disorders) (ICD10CM)', 'useContext': [{'code': {'system': 'http://terminology.hl7.org/CodeSystem/usage-context-type', 'code': 'focus'}, 'valueCodeableConcept': {'coding': [{'system': 'http://snomed.info/sct', 'code': '60826002'}], 'text': 'Coccidioidomycosis (disorder)'}}], 'compose': {'include': [{'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'concept': [{'code': 'B38.0', 'display': 'Acute pulmonary coccidioidomycosis'}, {'code': 'B38.1', 'display': 'Chronic pulmonary coccidioidomycosis'}, {'code': 'B38', 'display': 'Coccidioidomycosis'}, {'code': 'B38.4', 'display': 'Coccidioidomycosis meningitis'}, {'code': 'B38.9', 'display': 'Coccidioidomycosis, unspecified'}, {'code': 'B38.3', 'display': 'Cutaneous coccidioidomycosis'}, {'code': 'B38.7', 'display': 'Disseminated coccidioidomycosis'}, {'code': 'B38.8', 'display': 'Other forms of coccidioidomycosis'}, {'code': 'B38.89', 'display': 'Other forms of coccidioidomycosis'}, {'code': 'B38.81', 'display': 'Prostatic coccidioidomycosis'}, {'code': 'B38.2', 'display': 'Pulmonary coccidioidomycosis, unspecified'}]}]}, 'expansion': {'timestamp': '2024-04-02T12:19:29-04:00', 'contains': [{'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.0', 'display': 'Acute pulmonary coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.1', 'display': 'Chronic pulmonary coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38', 'display': 'Coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.4', 'display': 'Coccidioidomycosis meningitis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.9', 'display': 'Coccidioidomycosis, unspecified'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.3', 'display': 'Cutaneous coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.7', 'display': 'Disseminated coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.8', 'display': 'Other forms of coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.89', 'display': 'Other forms of coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.81', 'display': 'Prostatic coccidioidomycosis'}, {'system': 'http://hl7.org/fhir/sid/icd-10-cm', 'version': '2024', 'code': 'B38.2', 'display': 'Pulmonary coccidioidomycosis, unspecified'}]}}

robertandremitchell commented 6 months ago

Per conversation with Dan, those conditions can probably just be one table with the value set type (diagnoses, lab, prescription, etc.) in a second table to be able to link through,

The primary benefit of this setup is that it will scale of if more value set types are added. We will need to test to confirm that the two-table format is performant .

Also, per conversation with Dan, v1 of the eRSD, which uses eCR Now App Release 2.X., is not really used by many hospitals, which use v1 of eCR.

v2 of the eRSD is more likely to be relevant, as hospitals that are upgrading are more likely to upgrade to v3 of eCR.

We will first start with this version (attached), but we will still need to do further comparison of the API downloaded version of the file, which is slightly larger (23.5 MB) versus this file's 19MB. The main things we care about to compare:

Are the number of conditions in the conditions table different between these two files?
Are there different value sets?
Are there differences in codes represented within the value sets?

robertandremitchell commented 6 months ago

@DanPaseltiner , attached are the first pass at just creating a rough look at what the two-table structure could look like. I think the main thing to note is the size: value_set.csv is about 10.4MB and conditions.csv is 30.7MB.

conditions.csv can definitely become much smaller. With the way that both files are structured, they have a compose.include and expansion.contains sections that I think we both observed when looking last Friday had duplicates. Based on just a quick look, I would say about 50% of the codes in them are duplicates. The main difference between them, as far as I can tell, is just the description. For example, 772212002 appears twice, but in the compose.include has a Rubella suspected (situation) while the expansion.contains is blank. That should be relatively easy to refine and collapse. Below is a quick guesstimate showing we get to a 50% in the size of the file:

>>> c
                                    id                               name  ...              code                                      display
0         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...         194945009  Acute myocarditis - diphtheritic (disorder)
1         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...          15682004         Anterior nasal diphtheria (disorder)
2         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...  1086051000119107  Cardiomyopathy due to diphtheria (disorder)
3         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...           7773002           Conjunctival diphtheria (disorder)
4         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...          18901009              Cutaneous diphtheria (disorder)
...                                ...                                ...  ...               ...                                          ...
87941  2.16.840.1.113762.1.4.1146.2164      MumpsSuspectedDisordersSNOMED  ...         772145009                                          NaN
87942  2.16.840.1.113762.1.4.1146.2165  PertussisSuspectedDisordersSNOMED  ...         772146005              Pertussis suspected (situation)
87943  2.16.840.1.113762.1.4.1146.2165  PertussisSuspectedDisordersSNOMED  ...         772146005                                          NaN
87944  2.16.840.1.113762.1.4.1146.1440    RubellaSuspectedDisordersSNOMED  ...         772212002                Rubella suspected (situation)
87945  2.16.840.1.113762.1.4.1146.1440    RubellaSuspectedDisordersSNOMED  ...         772212002                                          NaN

[87946 rows x 12 columns]
>>> c.nunique()
id                     1346
name                   1346
title                  1346
status                    1
description            1346
use_context_system        1
use_context_code        209
use_context_text        209
system                    5
version                   6
code                  34358
display               34263
dtype: int64
>>> c.drop_duplicates(['id', 'code'])
                                    id                               name  ...              code                                            display
0         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...         194945009        Acute myocarditis - diphtheritic (disorder)
1         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...          15682004               Anterior nasal diphtheria (disorder)
2         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...  1086051000119107        Cardiomyopathy due to diphtheria (disorder)
3         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...           7773002                 Conjunctival diphtheria (disorder)
4         2.16.840.1.113762.1.4.1146.6          DiphtheriaDisordersSNOMED  ...          18901009                    Cutaneous diphtheria (disorder)
...                                ...                                ...  ...               ...                                                ...
87936  2.16.840.1.113762.1.4.1146.1441       SARSSuspectedDisordersSNOMED  ...    12591000132100  Suspected severe acute respiratory syndrome (s...
87937  2.16.840.1.113762.1.4.1146.1441       SARSSuspectedDisordersSNOMED  ...    12601000132105  Probable severe acute respiratory syndrome (si...
87940  2.16.840.1.113762.1.4.1146.2164      MumpsSuspectedDisordersSNOMED  ...         772145009                        Mumps suspected (situation)
87942  2.16.840.1.113762.1.4.1146.2165  PertussisSuspectedDisordersSNOMED  ...         772146005                    Pertussis suspected (situation)
87944  2.16.840.1.113762.1.4.1146.1440    RubellaSuspectedDisordersSNOMED  ...         772212002                      Rubella suspected (situation)

[43973 rows x 12 columns]

We can also definitely refine further based off what column names we want/need. Some are definitely redundant or could be compressed into a single json column in whatever schema we ultimately end up with. Curious if this is more or less right for what you think we'll need?

Also, to answer the question from standup, I did a quick aggregate and did not encounter any cases where a single condition had multiple code systems.

scratchwork.zip

robertandremitchell commented 6 months ago

are we good to close this ticket? @DanPaseltiner

I copy/pasted the above comment to 1663 to continue that review/conversation: https://app.zenhub.com/workspaces/skylight-dibbs-viper-6480ba23b5a07644e0e46d23/issues/gh/cdcgov/phdi/1663

CDCgov / phdi

SPIKE: Investigate eRSD value sets to understand the structure of the data #1661