Data generation - Githubissues

jb-adams commented 2 years ago

Use Synthea to create a cohort of 1000+ patients

be sure to document build and run commands and put in the repo
inspect synthetic dataset to ensure it has enough asthma-related phenotypes/codes
include good asthma example dataset in the repo once we've found one
output as both FHIR and CSV

Chen2x commented 2 years ago

Hello @susheel @PhilipvD I have added the build and run instructions in the README for generating the population in both FHIR and CSV. Unfortunately I was getting errors trying to upload the actual generated population since the files are too big but I did have some questions regarding some of the results. I believe we discussed this morning that the coding should be in ICD-10 however I am seeing that the output uses snomed codes instead. I tried to do some research but as far as I could see ICD-10 was not supported by Synthea. Would the mapping from snomed to ICD-10 come at a later step or am I missing something? I have attached a generated patient with asthma below.

{"resourceType":"Condition","id":"17556412-0356-de55-eb68-2cc2bebcaf8c","meta":{"profile":["http://hl7.org/fhir/us/core/StructureDefinition/us-core-condition"]},"clinicalStatus":{"coding":[{"system":"http://terminology.hl7.org/CodeSystem/condition-clinical","code":"active"}]},"verificationStatus":{"coding":[{"system":"http://terminology.hl7.org/CodeSystem/condition-ver-status","code":"confirmed"}]},"category":[{"coding":[{"system":"http://terminology.hl7.org/CodeSystem/condition-category","code":"encounter-diagnosis","display":"Encounter Diagnosis"}]}],"code":{"coding":[{"system":"http://snomed.info/sct","code":"195967001","display":"Asthma"}],"text":"Asthma"},"subject":{"reference":"Patient/9e6e8477-46ce-c5e2-8e2f-43298c3fe684"},"encounter":{"reference":"Encounter/d588f220-d088-9e19-9e3f-4020d27136c6"},"onsetDateTime":"2016-06-21T20:17:55-04:00","recordedDate":"2016-06-21T20:17:55-04:00"}

jb-adams commented 2 years ago

@susheel @PhilipvD another couple questions about the demo:

The Synthea run outputs a total of 22 .ndjson files, each containing a batch of records of a certain model. Here are the models: AllergyIntolerance, CarePlan, CareTeam, Claim, Condition, Device, DiagnosticReport, DocumentReference, Encounter, ExplanationOfBenefit, ImagingStudy, Immunization, Medication, MedicationAdministration, MedicationRequest, Observation, Patient, Procedure, Provenance, SupplyDelivery, hospitalInformation, practitionerInformation

Can we pull out a core set of models that are most relevant to the CQL demo (i.e. Patient, Condition, and others) rather than uploading the whole lot to HAPI? A dataset of 1000 Patients yields, for example, 100,000 DiagnosticReport records. It will simplify the demo and make it easier to rerun if the overall dataset to upload is small.

An additional way to pare down the uploaded dataset. The 1000 Patient dataset contains ~20 asthma patients. Since these 20 are the focus of the CQL query, can we remove a lot of the "patients" we're not interested in querying? E.g. Could we upload all 20 asthma patients and only 80 non-asthma patients, rather than 20 vs 980?

ianfore commented 2 years ago

Picking up a couple of things...

First on Jeremy's #1 above. Yes it would be to good decide which Resources (what J is calling models) we want (or are necessary) to work with for our use case. How did Synthea decide what to generate? Was that driven by our choice? e.g. some config file that controlled the generation?

Second, @Chen2x - did you solve the generation to both FHIR and csv? Does the same set of Patients get saved to both? If we can do that I don't need to bother working out how to read the FHIR resources for the purposes of loading into Data Connect - I can just use the csv's. That's the point I think Susheel made at the kick-off. (I do need to bother how to read the FHIR resources to query them though).

ga4gh / cohort-rep-hackathon

Data generation #2