ExposuresProvider / icees-api-config

Other
0 stars 1 forks source link

ICEES PCD datasets #84

Closed karafecho closed 1 year ago

karafecho commented 2 years ago

This issue is to report bugs that were identified during testing of the ICEES PCD datasets (v2, Hong's first run).

karafecho commented 2 years ago

Update from new run, 08.17.2022, 2016 and 2020 datasets (v3, Hong's second run):

This issue is to report bugs that were identified during testing of the ICEES PCD datasets.

karafecho commented 2 years ago

Update from new run, 08.4.2022, 2016 and 2022 datasets (v4, Hong's third run):

karafecho commented 1 year ago

JAN2023 tests of v6 datasets:

Note that I ran a variety of summary statistics on the datasets for years 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2018, 2020, and 2021. I looked at key demographics, exposures, visits, diagnoses, and meds within each dataset. My tests were semi-systematic.

  1. Years 2010, 2011, 2012, and 2013 are missing data on Rx variables and TotalEDVisits, TotalInpatientVisits, and TotalEDInpatientVisits. This is to be expected, I think, because James' updates were intended for years 2014-2020 (meds are legacy data prior to 2014, not sure what the issues are for visit). However, it's unfortunate because we have pretty rich airborne pollutant exposures data for years 2010-2013. That said, I'm inclined to perhaps expose the data but conceal (delete) the Rx and visit variables from the csv files for years 2010-2013, so that it's not misleading to users.
  2. We now have Rx variable mappings for years 2014-2021! Some of the new Rx variables (e.g., OrkambiRx) appear to be missing, although it's possible that they were simply never prescribed/administered. I am, however, seeing at least a few of the new Rx variables (e.g., IvacaftorRx), so that's reassuring.
  3. We are now capturing TotalEDVisits, TotalInpatientVisits, and TotalEDInpatientVisits for years 2014-2021!
  4. We are missing airborne pollutant exposure estimates for years 2017-2021. This is a data gap issue that I'm working to fix. As such, we should probably conceal (delete) those variables from the csv files for years 2017-2021.
  5. For consistency, and to aid analysis, we may want to replace the empty cells for the variable Confirmed_Dx with either a string entry of "Not_CF_Brx_PCD" (or something similar) or a zero, as these are patients who had their charts reviewed and were deemed to not have a diagnosis of CF, Brx, or PCD. FWIW, I'm inclined to replace the empty cells with a string entry.

In sum, @hyi and @maximusunc, I think we're set to move forward with new ICEES+ and ICEES KG PCD deployments, but only after a decision is made re (1) and (4). If the "empty" variables show up as "null" in the APIs, then I think we should be fine, but I'd prefer to get your input before making a decision.

karafecho commented 1 year ago

I split Confirmed_Dx into three variables, Confirmed_CF_Dx, Confirmed_IdiopathicBronchiectasisDx, and Confirmed_PCD_Dx, corresponding to those in that all_features YAML. I then copied the files to hop.renci.org at /projects/ebcr/pcd/data/patient/v6_rev_csv_files.

karafecho commented 1 year ago

@maximusunc : The new v6 pcd datasets are located on hop.renci.org at /projects/ebcr/pcd/data/patient/v6_rev_csv_files.

@hyi : ICEES+ PCD is now ready for redeployment with the new datasets.

karafecho commented 1 year ago

Closing as this is complete ...