Closed gowthamrao closed 5 months ago
I think the solution is
@jreps does PLP calculate feature extraction one cohort at a time, or does it calculate feature extract multiple cohorts at a time. If later, this may introduce errors. If not, it should be ok
@gowthamrao you bring up a good point and tagging @jreps and @schuemie for comment. I did a little digging here and found that CohortMethod and PatientLevelPrediction are doing some work on the cohorts to ensure that the cohortTable
passed into FeatureExtraction::getDbCovariateData
contains unique records per patient, per cohort (or at least per cohort_start_date). For reference:
CohortMethod: https://github.com/OHDSI/CohortMethod/blob/main/inst/sql/CreateCohorts.sql#L37 PLP: https://github.com/OHDSI/PatientLevelPrediction/blob/main/inst/sql/sql_server/CreateCohorts.sql#L31
In this way, these packages use FeatureExtraction
to extract features in bulk for all patients across all cohorts. So perhaps the answer here is to provide a better description of how to use FeatureExtraction::getDbCovariateData
with aggregated = FALSE
to prevent the problem you've described in this issue. Another thought is to potentially move the CreateCohorts.sql scripts referenced above into FeatureExtraction so we can have a base implementation for people to use.
This is not a bug, just expected behavior (so maybe we should just document it better as @anthonysena suggests). When aggregate = FALSE
, the output is restricted to just the cohort IDs you provide, in contrast to what @gowthamrao says. If a person is in multiple cohorts, and you choose to set your row ID to be your subject ID, then yes, there are collisions. The same is true if you have only 1 cohort, and a person has multiple entries (multiple start dates) in the same cohort.
If a person is in multiple cohorts
Wouldn't it be more robust to enforce the use of a single cohortId when aggregate = FALSE? Alternatively, if multiple cohortIds are used, we could require that their length is exactly one. This could significantly reduce the risk of inadvertent data collisions. Also, I'm curious if allowing collisions of cohortIds was an intentional design choice, similar to the deliberate and accepted design decision for cohort_start_date collisions.
Also, I'm curious if allowing collisions of cohortIds was an intentional design choice, similar to the deliberate and accepted design decision for cohort_start_date collisions.
A long time ago, FeatureExtraction didn't use rowId
but instead used subjectId
and cohortStartDate
to link the Andromeda cohort
table to the covariates
table, but that was abandoned for computational reasons. SubjectId (person_id) is a BIGINT (64-bit integer) in the CDM so has to be represented as a string in R. Joining on a string and a date is slow, and the covariate
table can have billions of rows so adding a string column takes a lot of space.
So I've normalized the model by introducing a rowId
that links cohort
and covariates
, and the subjectId and cohortStartDate are stored once, in the cohort
table. That of course requires that the user picks a rowId that is unique, which in most cases is the subjectId, except when its not you have to generate your own.
Cohort IDs are somewhat non-informative in this context, because the covariates constructed for a specific person for a specific start date are the same, even if that combination occurs in multiple cohorts.
We could have FeatureExtraction generate unique rowIds on the fly, which would be a breaking change.
Thank you for the explanation @schuemie . The FeatureExtraction package is currently amazingly fast - i would hate any design changes that impacts its computational performance, and your choices makes sense. Thank you for this contribution.
That of course requires that the user picks a rowId that is unique, which in most cases is the subjectId, except when its not you have to generate your own.
Yes, and as we discussed above - this is implicit. I think documentation would help make it explicit, and we should do that at a minimum. I would prefer an engineering solution that prevents use of multiple target cohortIds when aggregate = FALSE, or at-least throw a warning message. I think an engineering solution of restricting to one target cohortId at a time makes sense when aggregate = FALSE - i.e. why would we want to simultaneously extract features for multiple target cohorts? When aggregate = TRUE, computing features for multiple target cohorts make sense.
Cohort IDs are somewhat non-informative in this context, because the covariates constructed for a specific person for a specific start date are the same, even if that combination occurs in multiple cohorts.
Yes I agree partly. In my use case, i was using multiple target cohortIds with aggregate = FALSE, to get covariateData for multiple different target cohorts that i was planning to use. So i needed to know which cohortId the subject belonged to. Not having cohortId in the output surprised me. To solve my use case, i will just loop over one cohortId at a time and then append results locally.
Could you please confirm this
based on and discussion above
When aggregate = FALSE and rowId = 'subject_id' and this is an event cohort i.e. one subject_id may have more than one cohort_start_date -- it would aggregate by subject_id and provide counts for features across cohort_start_id.
i.e. i think this makes a case to support for (atleast as an option) subject_id, cohort_start_date = row_id
Making a note based on discussion with @ginberg that we can commit to updating the documentation to make the use of the aggregated
parameter clearer.
If we want to go deeper with the functionality as @gowthamrao suggests, we can open a new issue to work through the implications of adding this type of handling directly into FeatureExtraction.
Inconsistent Handling of
cohortIds
ingetDbCovariateData
Depending onaggregated
SettingIssue Description:
I've encountered a potential bug in the
FeatureExtraction
package'sgetDbCovariateData
function. The function behaves inconsistently when handling thecohortIds
parameter, depending on the state of theaggregated
argument.Detailed Explanation:
Case 1:
aggregated = FALSE
When running
getDbCovariateData
withaggregated = FALSE
, the function returns a table withrowId
representingsubject_id
. However, it appears to ignore thecohortIds
parameter. This is problematic, especially if asubjectId
is present in multiplecohortId
s, leading to ambiguous or misleading results. Which cohort does the subjectId belong to?Code Snippet:
output:
Case 2:
aggregated = TRUE
Conversely, when
aggregated
is set toTRUE
, the function includescohortDefinitionId
and handles the multiple cohortIds.Code Snippet:
Impact:
This inconsistency leads:
aggregated = FALSE
, I cannot discern if the covariates are related to the same subject across different cohortIds.I believe this issue emerged after the introduction of the
aggregate = TRUE
option and the shift fromcohortId
tocohortIds
. The primary concern is that end users might not realize thatcohortIds
are being overlooked in the non-aggregated mode, leading to potentially incorrect analyses.Suggested Resolution:
A potential fix could involve ensuring that
cohortIds
are appropriately handled in bothaggregated
andnon-aggregated
modes. Specifically, foraggregated = FALSE
, it would be beneficial to include bothsubjectId
andcohortDefinitionId
in the output to clarify the association of subjects with specific cohorts.