Open gowthamrao opened 4 years ago
Adding quantitative diagnostics to cohort-diagnostics incidence plot. There are methods/implementations in time-series that are able to (with different levels of specificity/sensitivity/PPV/NPV) detect the presence/absence of seasonality/periodicity. Similarly, there are methods/implementations that are also able to detect change points/structural breaks. Using the stats:stl, is one approach to decompose a time-series plot - the output maybe used to generate "smoothed plots" that may further inform diagnostic interpretation.
@schuemie many of these need several data points (~30+ points). i.e. incidence rate may need to be computed monthly or quarterly (in addition current default of yearly). Do you think we could add the ability to compute IR on additional calendar year period (calendar month, calendar quarter)? This may only be useful for larger cohorts.
The current solution in CohortDiagnostics is that the numerator and denominator are stored at all the various levels of stratification (so both stratified by age and calendar year as well as only stratified by age). If the counts are too low when stratified too far, you still have the unstratified / less-stratified counts. You could do the same thing here.
There are inherent issues with stratifying per month or even per day. One obvious issue is identifiability of persons, which we solve by using the minCellCount, but would leave most cells labeled as 'too small'. Another is simple file size: stratify too far and the files to share become very big.
One solution could be to compute the features you're interested in, such as periodicity and inflection points, locally using high-granularity data, and communicate those features and lower-granularity data between sites.
One solution could be to compute the features you're interested in, such as periodicity and inflection points, locally using high-granularity data, and communicate those features and lower-granularity data between sites.
@schuemie my intuition is to agree with you here. The high granularity data (in this case, unit of analysis is calendar_date
) needs to be brought into R during Data Pulls into the R environment. The stratas such as age_group, gender are stratas of calendar_date
. The file with high granular information is supposed to be retained at the local site (inside the analytic environment) and not designed for sharing across the sites.
The minimum cell count are applied to this file - at the analysis time in R. Only the analytic output, after applying the minimum cell count rule are shared across sites.
Is this accurate?
Yes
We need a framework to interpret incidence rate plot in cohortDiagnostics.
It is generally useful to review such a plot and infer if the pattern observed is because of
How do we untangle these? Use of visualization tools such as those used in this paper, showing the diseases observed prior to some cohort entry (in this case cancers) maybe useful.