Utility to Analytics group of data for fields frequently missing from CDM submissions

DaveraGabriel commented 4 years ago

PCORnet captures Provider and Provider IDs in the data set. ACT does not. Discussion in the mapping validation meetings ponder the utility of this field in the N3C data set if there is systematically missing data for a field such as Provider.

Also pertains to Immunization in PCORNet - some sites may populate this and others may not

hlehmann17 commented 4 years ago

SiteID is one that we have control over its being missing. We all agree that there should be no explicit indication in the data as the source of the data, recognizing that zipcodes in the patient data will provide a strong indicator in cities with only one data donor.

We have two options: (1) Delete siteIDs from data uploaded to Palantir (2) Retain siteIDs in the data, but have a standard operating procedure rule that no analysis presents results that enable reidentification of the site

(1) Delete SiteIDs Pro: Prevents mischief and inadvertent reidentification Con: Hamstrings the analyses (see #2)

(2) Retain IDs Pro: Enables the following types of analysis, recommended by analysts of Real World Evidence (e.g., FDA workshop Oct 2019): Variability that swamps the effect [confounding] Causal mediator Semantics of missing (→Missing at random or not) Range of data obfuscation, for comparability Propensity score ?instrumental variable Negative control

hlehmann17 commented 4 years ago

(2) Retain site IDs (continued) (2) Retain IDs Pro: Enables the following types of analysis, recommended by analysts of Real World Evidence (e.g., FDA workshop Oct 2019): Variability that swamps the effect [confounding] Variability across sites includes biomedical issues (differences in prevalence in the surrounding community, differences in medical practice) as well as informatics issues (differences in how codes are used). If this variability noise is greater than the covid signal we are seeking, we won't see the signal. But we cannot/are not representing each of these differences. So a designation of "site" is the minimum we can do to account for this noise. Causal mediator More than just a confounder, differences across sites may have biomedical impact, as noted above. It would a shame (or more) to eliminate an explicit, relevant causal factor Semantics of missing (→Missing at random or not) There will be a strong temptation to impute missing data. The first decision is whether data are missing at random or not at random. Looking at missingness by site could help in that decision. Range of data obfuscation, for comparability It may be (and needs to be checked) that different sites use different ranges for obfuscation, and knowledge of that range may help (dissuade) analyses time-based analytics Propensity score In the creation of controls, propensity scores will be important (either for matching or simply as a covariate). Going back to the variability discussed earlier, site identity will be an important component in building such scores. ?instrumental variable I'm not sure if siteID itself can count as a instrumental variable (going back to site as causal mediator) Negative control Controls almost certainly have to be constructed within sites. Eliminating siteID eliminates that possibility.

National-Clinical-Cohort-Collaborative / Data-Ingestion-and-Harmonization

Utility to Analytics group of data for fields frequently missing from CDM submissions #26