ICEES+ API and DB - incorrectly concatenating asthma csv files

karafecho commented 1 year ago

This issue is to report that the db is concatenating the 10 asthma csv files and treating each row as a unique patient, which is not correct. For example, if I create a cohort without specifying any feature variables, then ICEES+ returns a sample size of 1,600,000 patients (160,000 actual patients x 10 years of data). In some sense, this is not a new issue but rather relates to the ongoing design discussions and associated decisions, but we should resolve it. There is a dummy variable that links patients across years / csv files, so we should be able to readily resolve any db and/or design issues.

karafecho commented 1 year ago

@jjgarciac: The above is not a FHIR PIT issue, but does relate to the general design of ICEES+, so tagging you to chime in if you have any thoughts. Thanks!

hyi commented 1 year ago

@karafecho I double checked the db and verified patients are identified by the PatientID column which corresponds to the index column in the patient csv files, and the total number of patients identified by the PatientID column reflects the number of patients correctly. However, each patient is stored in multiple rows with each row corresponding to a year if that patient has hospital visits that year, so the total number of patients across all years are treated as the total sample size if there is no feature variables are specified. Since we allow users to do feature associations across all years or for a specific year, it may make sense to have a cohort sample size reflect all patients with feature variables associated with their visits over all years rather than reflect the unique number of patients, since there are different number of patients for different years depending on their visits. We can discuss this at our meeting if needed.

karafecho commented 1 year ago

This all makes sense and reflects my understanding, too. However, I'm not sure this treatment makes sense.

A few comments:

study_period vs year issue. I think I was overthinking this issue. We have a feature variable Active_In_Year that allows users to select for only those patients who are 'active' or had a visit that year. So that resolves the study_period vs year issue, and we can leave year as a filter for the API functionality.

  Active_In_Year:
    categories:
    - biolink:PhenotypicFeature
    maximum: 1
    minimum: 0
    type: integer

PatientID column in db, index column in csv. If these are unique, then the total sample size should be the maximum of PatientID/index, I think.
Total number of observations vs patients. As noted above, if I create a cohort without specifying any feature variables, then ICEES+ returns a sample size of 1,600,000 patients (160,000 actual patients x 10 years of data). I think this calculation is actually the total number of observations, not the total number of patients (i.e., the sample size).

hyi commented 1 year ago

@karafecho All your comments make sense to me. I thought it made more sense to treat the total sample size as the total number of patients, but am not sure whether the statistics for feature associations across years for a cohort are computed based on the total number of observations or the total number of patients, and thought total number of observations were used as the total sample size for computing the feature association statistics. I'll look into the code trying to understand how we are currently handling the total sample size in a cohort across years as part of this issue. As my preliminary investigation, I did notice for PCD database, the total number of patients is different for different years depending on visits which was the main reason that made me think it probably made more sense to treat total number of observations as sample size in a cohort. It seems it is not the case for asthma database, though, as the number of patients appears to be the same for each year in the study period and we can use Active_In_year feature variable to see whether a patient is active in that year. For PCD data, as part of the commit in FHIR PIT from Hao, Active_in_year is changed to Active_In_Study_Period, and as a result, Active_in year is all null for all patients where 45438 out of 45476 total patient observations in patients data across all years have Active_In_Study_Period as 1 while 36 have it as 0 and 2 have it as null. Looks like there were potential issues with year->study_period and active_in_year->Active_In_Study_Period changes in FHIR PIT.

hyi commented 1 year ago

@karafecho Supplemental info: the 2 null patient records in the PCD patient table actually have all columns null including PatientID column, so these 2 records are null records, which should not exist in the first place. Will look into how that happened and fix it as part of removing identifiers yaml file dependency from icees-db and icees-api.

karafecho commented 1 year ago

I think we're largely on the same page. The design of ICEES is very tricky, however, so we may need to think things through a bit more.

I think we both agree that total sample size should be the total number of patients included within the cohort, whether active for a given year or not (i.e., the patients who meet the inclusion/exclusion criteria at the time of cohort creation). If so, then the maximum integer for PatientID/index represents the total sample size.
When a user doesn't specify a feature(s) for a given cohort or select a year when applying a feature association functionality, then the feature association functionality will calculate associations based on the total number of observations, not patients.
If a user creates a (sub)cohort and includes Active_In_Year, and then runs a functionality with a selection of year, then that should subset to only those patients within the overall cohort who had a visit in the selected year.
I think that Hao initiated changes to the treatment of feature variables and year/study_period, based on my recommendation, but I don't think he had time to fully implement the changes before leaving RENCI. So, yes, I agree that there are issues with the conversions year -> study_period (and now back to year) and Active_In_Year -> Active_In_Study_Period.
The counts you list above for the PCD datasets seem accurate to me, in terms of observations across the entire 11 years of data: _with all patients where 45438 out of 45476 total patient observations in patients data across all years have Active_In_StudyPeriod as 1 while 36 have it as 0 and 2 have it as null. I think that the 36 patients with Active_In_Study_Period = 0 across the sampling period met the selection criteria but did not have any visits over the years of data that were sampled. The null patient records were probably included by mistake and likely represent a CAMP FHIR issue or perhaps an icees-db issue.

Thanks for the troubleshooting. Please let me know if a one-off troubleshooting meeting would be helpful.

hyi commented 1 year ago

@karafecho all these points make sense to me, with a couple of follow-up questions or action items to run by you:

Should we revert Hao's commit back in FHIR PIT to revert study_period back to year and Active_In_Study_Period back to Active_in_Year? I think we should, but let me know if you disagree or see any ramifications.
Is it true for asthma data there are same number of total patients for each year in the study period? If so, this is not the case for PCD data. For example, for PCD data, there were 1311 total patients for year 2010 where 1310 patients were active in year and 1 patient was inactive, while for year 2020 there were 4753 total patients where 4750 patients were active in year and 3 patients were inactive.

I will remove the 2 all null patient records and copy Active_In_Study_Period column to Active_In_Year column in PCD sqlite database to fix the issues in the database for now and redeploy PCD dev and prod instances.

karafecho commented 1 year ago

@hyi:

I agree that we should revert Hao's commit back. This wasn't fully implemented and is now causing issues, so probably best to revert back and reconsider the plan.
The sample sizes for the asthma datasets vary a bit by year, but are roughly 160K each year. For example N = 159327 in 2010, 159169 in 2011, and 155333 in 2019. The asthma patients were selected in 2010; hence the decrease in sample size over future years as patients drop out of the healthcare system due to death, etc. In contrast, the PCD patients were selected in year 2020; hence the decrease in sample size over prior years as patient records may not have existed in the past. The overall design is a bit confusing, but I'm not sure I see a clear alternative.

hyi commented 1 year ago

@karafecho Thanks for the clarification. It makes sense to me now. I have cleaned up PCD database as detailed above and redeployed the PCD dev and prod instances. Now if you create a cohort with no feature selected, it will return 45474 (2 less than the previous return). I will revert Hao's commit back in FHIR PIT so we will not run into study_period vs year and Active_In_Year vs Active_In_Study_Period issue going forward when running FHIR PIT again to create new datasets. I'll also fix the cohort creation POST endpoint to make sure it returns total number of patients rather than total number of observations.

karafecho commented 1 year ago

Your plan sounds great to me, Hong. Thanks!

karafecho commented 1 year ago

Closing this issue after testing all new prod deployments and verifying that fixes are now in place.

ExposuresProvider / icees-api

ICEES+ API and DB - incorrectly concatenating asthma csv files #247