Inconsistent Handling of `cohortIds` in `getDbCovariateData` Depending on `aggregated` Setting

gowthamrao commented 9 months ago

Inconsistent Handling of `cohortIds` in `getDbCovariateData` Depending on `aggregated` Setting

Issue Description:

I've encountered a potential bug in the FeatureExtraction package's getDbCovariateData function. The function behaves inconsistently when handling the cohortIds parameter, depending on the state of the aggregated argument.

Detailed Explanation:

Case 1: aggregated = FALSE

When running getDbCovariateData with aggregated = FALSE, the function returns a table with rowId representing subject_id. However, it appears to ignore the cohortIds parameter. This is problematic, especially if a subjectId is present in multiple cohortIds, leading to ambiguous or misleading results. Which cohort does the subjectId belong to?

Code Snippet:
```
FeatureExtraction::getDbCovariateData(
connection = connection,
oracleTempSchema = tempEmulationSchema,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortIds = cohortIds,
covariateSettings = cohortCovariateSettings,
aggregated = FALSE,
cohortTable = cohortTable,
cohortDatabaseSchema = cohortDatabaseSchema
)
```

output:

Tables: $analysisRef (analysisId, analysisName, domainId, isBinary, missingMeansZero) $covariateRef (covariateId, covariateName, analysisId, conceptId) $covariates (rowId, covariateId, covariateValue, timeId) $timeRef (timeId, startDay, endDay)

Case 2: aggregated = TRUE

Conversely, when aggregated is set to TRUE, the function includes cohortDefinitionId and handles the multiple cohortIds.

Code Snippet:

FeatureExtraction::getDbCovariateData(
connection = connection,
oracleTempSchema = tempEmulationSchema,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortIds = cohortIds,
covariateSettings = cohortCovariateSettings,
aggregated = TRUE,
cohortTable = cohortTable,
cohortDatabaseSchema = cohortDatabaseSchema
)

Tables: $analysisRef (analysisId, analysisName, domainId, isBinary, missingMeansZero) $covariateRef (covariateId, covariateName, analysisId, conceptId) $covariates (cohortDefinitionId, covariateId, timeId, sumValue, averageValue) $timeRef (timeId, startDay, endDay)

Impact:

This inconsistency leads:

If I use aggregated = FALSE, I cannot discern if the covariates are related to the same subject across different cohortIds.

I believe this issue emerged after the introduction of the aggregate = TRUE option and the shift from cohortId to cohortIds. The primary concern is that end users might not realize that cohortIds are being overlooked in the non-aggregated mode, leading to potentially incorrect analyses.

Suggested Resolution:

A potential fix could involve ensuring that cohortIds are appropriately handled in both aggregated and non-aggregated modes. Specifically, for aggregated = FALSE, it would be beneficial to include both subjectId and cohortDefinitionId in the output to clarify the association of subjects with specific cohorts.

gowthamrao commented 9 months ago

I think the solution is

we do not allow > 1 cohortId when aggregate = FALSE
or, return cohortDefiniitonId in the output (breaking change)

gowthamrao commented 9 months ago

@jreps does PLP calculate feature extraction one cohort at a time, or does it calculate feature extract multiple cohorts at a time. If later, this may introduce errors. If not, it should be ok

anthonysena commented 9 months ago

@gowthamrao you bring up a good point and tagging @jreps and @schuemie for comment. I did a little digging here and found that CohortMethod and PatientLevelPrediction are doing some work on the cohorts to ensure that the cohortTable passed into FeatureExtraction::getDbCovariateData contains unique records per patient, per cohort (or at least per cohort_start_date). For reference:

CohortMethod: https://github.com/OHDSI/CohortMethod/blob/main/inst/sql/CreateCohorts.sql#L37 PLP: https://github.com/OHDSI/PatientLevelPrediction/blob/main/inst/sql/sql_server/CreateCohorts.sql#L31

In this way, these packages use FeatureExtraction to extract features in bulk for all patients across all cohorts. So perhaps the answer here is to provide a better description of how to use FeatureExtraction::getDbCovariateData with aggregated = FALSE to prevent the problem you've described in this issue. Another thought is to potentially move the CreateCohorts.sql scripts referenced above into FeatureExtraction so we can have a base implementation for people to use.

schuemie commented 9 months ago

This is not a bug, just expected behavior (so maybe we should just document it better as @anthonysena suggests). When aggregate = FALSE, the output is restricted to just the cohort IDs you provide, in contrast to what @gowthamrao says. If a person is in multiple cohorts, and you choose to set your row ID to be your subject ID, then yes, there are collisions. The same is true if you have only 1 cohort, and a person has multiple entries (multiple start dates) in the same cohort.

gowthamrao commented 9 months ago

If a person is in multiple cohorts

Wouldn't it be more robust to enforce the use of a single cohortId when aggregate = FALSE? Alternatively, if multiple cohortIds are used, we could require that their length is exactly one. This could significantly reduce the risk of inadvertent data collisions. Also, I'm curious if allowing collisions of cohortIds was an intentional design choice, similar to the deliberate and accepted design decision for cohort_start_date collisions.

schuemie commented 9 months ago

Also, I'm curious if allowing collisions of cohortIds was an intentional design choice, similar to the deliberate and accepted design decision for cohort_start_date collisions.

A long time ago, FeatureExtraction didn't use rowId but instead used subjectId and cohortStartDate to link the Andromeda cohort table to the covariates table, but that was abandoned for computational reasons. SubjectId (person_id) is a BIGINT (64-bit integer) in the CDM so has to be represented as a string in R. Joining on a string and a date is slow, and the covariate table can have billions of rows so adding a string column takes a lot of space.

So I've normalized the model by introducing a rowId that links cohort and covariates, and the subjectId and cohortStartDate are stored once, in the cohort table. That of course requires that the user picks a rowId that is unique, which in most cases is the subjectId, except when its not you have to generate your own.

Cohort IDs are somewhat non-informative in this context, because the covariates constructed for a specific person for a specific start date are the same, even if that combination occurs in multiple cohorts.

We could have FeatureExtraction generate unique rowIds on the fly, which would be a breaking change.

gowthamrao commented 9 months ago

Thank you for the explanation @schuemie . The FeatureExtraction package is currently amazingly fast - i would hate any design changes that impacts its computational performance, and your choices makes sense. Thank you for this contribution.

That of course requires that the user picks a rowId that is unique, which in most cases is the subjectId, except when its not you have to generate your own.

Yes, and as we discussed above - this is implicit. I think documentation would help make it explicit, and we should do that at a minimum. I would prefer an engineering solution that prevents use of multiple target cohortIds when aggregate = FALSE, or at-least throw a warning message. I think an engineering solution of restricting to one target cohortId at a time makes sense when aggregate = FALSE - i.e. why would we want to simultaneously extract features for multiple target cohorts? When aggregate = TRUE, computing features for multiple target cohorts make sense.

Cohort IDs are somewhat non-informative in this context, because the covariates constructed for a specific person for a specific start date are the same, even if that combination occurs in multiple cohorts.

Yes I agree partly. In my use case, i was using multiple target cohortIds with aggregate = FALSE, to get covariateData for multiple different target cohorts that i was planning to use. So i needed to know which cohortId the subject belonged to. Not having cohortId in the output surprised me. To solve my use case, i will just loop over one cohortId at a time and then append results locally.

gowthamrao commented 9 months ago

Could you please confirm this

based on and discussion above

https://github.com/OHDSI/FeatureExtraction/blob/8a7cb70c3d0cae4862fdf8ca286d5bc77adc74e8/inst/sql/sql_server/ConceptCounts.sql#L52-L55

When aggregate = FALSE and rowId = 'subject_id' and this is an event cohort i.e. one subject_id may have more than one cohort_start_date -- it would aggregate by subject_id and provide counts for features across cohort_start_id.

i.e. i think this makes a case to support for (atleast as an option) subject_id, cohort_start_date = row_id

anthonysena commented 8 months ago

Making a note based on discussion with @ginberg that we can commit to updating the documentation to make the use of the aggregated parameter clearer.

If we want to go deeper with the functionality as @gowthamrao suggests, we can open a new issue to work through the implications of adding this type of handling directly into FeatureExtraction.

OHDSI / FeatureExtraction