CDCgov / IDWA

Intelligent Data Workflow Automation
Apache License 2.0
1 stars 1 forks source link

SPIKE: LAC eCR Exploratory Data Analysis #89

Open alhayward opened 2 months ago

alhayward commented 2 months ago

Topic Exploratory Data Analysis (EDA) of de-identified LA County (LAC) eCR data, focusing on analyzing variable distributions through descriptive statistics and visualizations.

Background During their partnership with LAC, the DIBBs team gained access to a sample of 1200 de-identified eCRs. The DIBBs team received no data provenance information regarding this sample, such as why these specific eCRs were shared (time range, cases, etc.) and how they were de-identified. In their work, the DIBBs team did not complete a comprehensive EDA of these eCRs, but had hoped to do so given more time. Additionally, once they gained access to LAC production eCR data, the DIBBs team needed to overhaul their Record Linkage algorithm because the distributions of the production LAC data differed significantly from those of the synthetic LAC data on which they had developed the algorithm.

Hypothesis

Problem Hypothesis: We believe the LAC eCR data holds valuable insights to inform our deduplication work, but has yet to be explored due to previous time/resource constraints. In fact, we believe insights from real yet de-identified eCR data will more effectively inform our deduplication efforts than synthetic eCR data, since it is more representative of real-world patient populations. This is based on the DIBBs team’s experience reworking the Record Linkage algorithm once they gained access to LAC production data. Based on the fact that this is the closest data we have to real, production eCR data, we believe it is worth exploring and identifying patterns in the LAC data to best inform our approaches to deduplication.

Solution Hypothesis: By performing EDA on the LAC eCR data, we believe that our deduplication approaches will be more data-driven, and thus performant, scalable, and usable, because they will be developed from insights drawn from real-world eCR data. This avoids the problem of building a less performant, less usable solution because it was developed using unrealistic synthetic data.

Objective

Questions

  1. How sparse is the LAC data? / How many null values are in the LAC data? (Feature-wise, document-wise)
  2. What are the distributions of the linkage variables in the LAC data (first name, last name, etc.)?
  3. What are the distributions of the non-linkage variables in the LAC data (encounter information, (non-identifying) patient information, etc.)?
  4. What is the time range of the LAC data?
  5. Are there outliers in the LAC data?
  6. In the LAC data, what are the top 10 most common…
    • Cases?
    • Diagnoses?
    • Reasons for Visit?
    • etc.