Add parameter to allow for sampling of T and C in the data fetch step

pbr6cornell commented 6 years ago

Use case: sometimes, during initial feasibility, it may be useful to sample from T and C to fit a propensity score model and execute diagnostics to assess the adequacy of a study, prior to implementing a full study for the outcome of interest. Sampling T/C can reduce the data size and the wait time associated with computing feature extraction and data download.

I think the parameter should be added to the function getDbCohortMethodData.

The PatientLevelPrediction package has an analogous parameter in getPLPData called 'sampleSize', seen here.

schuemie commented 6 years ago

I'd like to push back on this a bit for the following reasons:

In our current environment, the data fetch typically takes no more than half an hour, which seems a reasonable time for a feasibility study.
In the develop branch I've already added the option to sample the cohorts prior to fitting the PS model. The fitting of the PS model typically can take up to 2 days in our environment, so sampling seems more helpful here.
The way to sample is not entirely obvious to me: we could sample uniformly across T and C, but if T or C are of very different sizes we might shrink one cohort too much. If we sample both T and C separately (as done in the createPs function as mentioned above) the ratio changes, making interpretation difficult (in the createPs function this is automatically corrected for).

Maybe we could just introduce a generic function that generates new cohorts by randomly sampling from other cohorts? It doesn't solve problem 3, but at least it puts the responsibility in the hands of the user.

schuemie commented 6 years ago

Ok, added the option anyway (needed it for the method evaluation, where some cohorts had >7 mln subjects).

In the current development version getDbCohortMethodData has a new argument maxCohortSize. If set to a value >0, both target and comparator cohort will be restricted to this size (through random sampling): https://github.com/OHDSI/CohortMethod/commit/a54d83492903d7a098605f5cc525bbc74efe5cbf

Of course, this argument has also percolated to the createGetDbCohortMethodDataArgs function.

OHDSI / CohortMethod

Add parameter to allow for sampling of T and C in the data fetch step #58