ExposuresProvider / icees-api

MIT License
2 stars 8 forks source link

PCD endpoint is hitting older dataset #282

Closed karafecho closed 10 months ago

karafecho commented 11 months ago

This issue is to note that the PCD endpoint is hitting an older dataset, one that was missing data on race. For example:

  1. Discover cohort

curl -X 'POST' \ 'https://icees-pcd.renci.org/patient/cohort' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{}'

"return value": { "cohort_id": "COHORT:1", "size": 7940 }

  1. Features for COHORT:1

curl -X 'GET' \ 'https://icees-pcd.renci.org/patient/cohort/COHORT%3A1/features' \ -H 'accept: text/tabular'

Excerpt:

+---------------------------------------------+---------+
| feature                                     | count   |
+=============================================+=========+
| Race_UNC = Native Hawaiian/Pacific Islander | 0       |
|                                             | 0.00%   |
+---------------------------------------------+---------+
| Race_UNC = Caucasian                        | 0       |
|                                             | 0.00%   |
+---------------------------------------------+---------+
| Race_UNC = African American                 | 0       |
|                                             | 0.00%   |
+---------------------------------------------+---------+
| Race_UNC = Asian                            | 0       |
|                                             | 0.00%   |
+---------------------------------------------+---------+
| Race_UNC = Unknown                          | 0       |
|                                             | 0.00%   |
+---------------------------------------------+---------+
| Race_UNC = American/Alaskan Native          | 0       |
|                                             | 0.00%   |
+---------------------------------------------+---------+
| Race_UNC = Other                            | 0       |
|                                             | 0.00%   |
+---------------------------------------------+---------+
| Race_UNC = None                             | 335686  |
|                                             | 100.00% |
+---------------------------------------------+---------+
karafecho commented 11 months ago

This dataset PCD_UNC_patient_2010_v6_binned_deidentified contains data for "Race":

Excerpt:

TotalEDInpatientVisits | Sex2 | Sex | Race | Ethnicity | MajorRoadwayHighwayExposure
-- | -- | -- | -- | -- | --
0 | Female | Female | African American | Not Hispanic | 5
0 | Female | Female | African American | Not Hispanic | 6
0 | Female | Female | Caucasian | Not Hispanic | 6
0 | Female | Female | Caucasian | Not Hispanic | 6
0 | Male | Male | Caucasian | Not Hispanic | 6
TotalEDInpatientVisits | Sex3 | Sex | Race | Ethnicity | MajorRoadwayHighwayExposure
0 | Female | Female | African American | Not Hispanic | 6.4
0 | Female | Female | African American | Not Hispanic | 6.6
0 | Female | Female | Caucasian | Not Hispanic | 6.8
0 | Female | Female | Caucasian | Not Hispanic | 7
0 | Male | Male | Caucasian | Not Hispanic | 7.2
0 | Female | Female | Caucasian | Not Hispanic | 1
0 | Male | Male | Caucasian | Not Hispanic | 6
0 | Male | Male | Caucasian | Not Hispanic | 6
0 | Female | Female | African American | Hispanic | 2
0 | Female | Female | Caucasian | Not Hispanic | 6
0 | Female | Female | Caucasian | Not Hispanic | 6
0 | Male | Male | American/Alaskan Native | Not Hispanic | 6
hyi commented 11 months ago

@karafecho I looked into this issue and found the root cause for this issue is the discrepancy between the patient data and the feature definition in the feature yaml file. Specifically, the patient data has the feature variable named as "Race" as you indicated above, but the corresponding feature is defined as "Race_UNC" and "RACE" in the feature yaml file which don't match with the feature name in the patient file, hence FHIR PIT created None for all patients in Race_UNC and RACE feature columns. We will need to fix the pcd feature yaml file and rerun FHIR PIT in order to fix this issue.

karafecho commented 11 months ago

Kara to update YAML file after Hong writes a script to create a diff file showing discrepancies in variables between the patient dataset and the all_features YAML file.

karafecho commented 10 months ago

Complete, closing issue ...