Topic
Exploratory Data Analysis (EDA) of de-identified LA County (LAC) eCR data, focusing on analyzing variable distributions through descriptive statistics and visualizations.
Background
During their partnership with LAC, the DIBBs team gained access to a sample of 1200 de-identified eCRs. The DIBBs team received no data provenance information regarding this sample, such as why these specific eCRs were shared (time range, cases, etc.) and how they were de-identified. In their work, the DIBBs team did not complete a comprehensive EDA of these eCRs, but had hoped to do so given more time. Additionally, once they gained access to LAC production eCR data, the DIBBs team needed to overhaul their Record Linkage algorithm because the distributions of the production LAC data differed significantly from those of the synthetic LAC data on which they had developed the algorithm.
Hypothesis
Problem Hypothesis:
We believe the LAC eCR data holds valuable insights to inform our deduplication work, but has yet to be explored due to previous time/resource constraints. In fact, we believe insights from real yet de-identified eCR data will more effectively inform our deduplication efforts than synthetic eCR data, since it is more representative of real-world patient populations. This is based on the DIBBs team’s experience reworking the Record Linkage algorithm once they gained access to LAC production data. Based on the fact that this is the closest data we have to real, production eCR data, we believe it is worth exploring and identifying patterns in the LAC data to best inform our approaches to deduplication.
Solution Hypothesis:
By performing EDA on the LAC eCR data, we believe that our deduplication approaches will be more data-driven, and thus performant, scalable, and usable, because they will be developed from insights drawn from real-world eCR data. This avoids the problem of building a less performant, less usable solution because it was developed using unrealistic synthetic data.
Objective
Understanding this topic better will allow us to statistically analyze and visualize characteristics of the LAC patient population based on a sample, allowing us to identify data-driven insights and limitations within our eCR deduplication approaches.
Understanding this topic better will allow us to advocate to stakeholders the value of early access to STLT production eCR data, and the limitations of our deduplication work without it.
Our goal at the end of this research spike is to be able to present descriptive statistics and visualizations of the LAC eCR data and their insights.
Our goal at the end of this research spike is to be able to standardize a set of data questions to ask of a STLT’s production eCR data via analysis once we gain access.
Questions
How sparse is the LAC data? / How many null values are in the LAC data? (Feature-wise, document-wise)
What are the distributions of the linkage variables in the LAC data (first name, last name, etc.)?
What are the distributions of the non-linkage variables in the LAC data (encounter information, (non-identifying) patient information, etc.)?
Topic Exploratory Data Analysis (EDA) of de-identified LA County (LAC) eCR data, focusing on analyzing variable distributions through descriptive statistics and visualizations.
Background During their partnership with LAC, the DIBBs team gained access to a sample of 1200 de-identified eCRs. The DIBBs team received no data provenance information regarding this sample, such as why these specific eCRs were shared (time range, cases, etc.) and how they were de-identified. In their work, the DIBBs team did not complete a comprehensive EDA of these eCRs, but had hoped to do so given more time. Additionally, once they gained access to LAC production eCR data, the DIBBs team needed to overhaul their Record Linkage algorithm because the distributions of the production LAC data differed significantly from those of the synthetic LAC data on which they had developed the algorithm.
Hypothesis
Problem Hypothesis: We believe the LAC eCR data holds valuable insights to inform our deduplication work, but has yet to be explored due to previous time/resource constraints. In fact, we believe insights from real yet de-identified eCR data will more effectively inform our deduplication efforts than synthetic eCR data, since it is more representative of real-world patient populations. This is based on the DIBBs team’s experience reworking the Record Linkage algorithm once they gained access to LAC production data. Based on the fact that this is the closest data we have to real, production eCR data, we believe it is worth exploring and identifying patterns in the LAC data to best inform our approaches to deduplication.
Solution Hypothesis: By performing EDA on the LAC eCR data, we believe that our deduplication approaches will be more data-driven, and thus performant, scalable, and usable, because they will be developed from insights drawn from real-world eCR data. This avoids the problem of building a less performant, less usable solution because it was developed using unrealistic synthetic data.
Objective
Questions