cellclockr
is designed to streamline the analysis of associations between epigenetic clocks and immune cell composition across various exposures and contexts. It allows us to harmonize data products across studies without requiring our collaborators to share raw data.
The package supports data preparation, model fitting, and output generation, including stratified analyses and visualizations. It is built in R and relies on a minimal set of well-established packages to ensure broad compatibility.
This document supports the accompanying package. It guides users through the setup and requirements, provides instructions to run each included function, and displays example output.
If you would rather temporarily share your data and have us perform the analysis, please contact Calen Ryan (cpr2139\@cumc.columbia.edu), Will Marella (wm2530\@cumc.columbia.edu), or Dan Belsky (db3275\@cumc.columbia.edu).
If you don't already have the devtools
package installed, install it now
install.packages("devtools")
Load the library on your instance of R
library(devtools)
Install the cellclockr
package
install_github("CPRyan/cellclockr")
This pipeline was constructed in R version 4.4.1 -- All versions 2023-06-16 or later should be compatible with the packages used.
An effort was made to minimize the use of unusual or uncommon packages. Nevertheless, the analysis relies on certain packages that must be installed prior to execution.
dplyr
: Used for easy data manipulationtidyr
: Used for easy data manipulationggplot2
: Used for data visualization and plot generationpatchwork
: Used for patching together several figures easilybroom
: Used for cleaning up model outputs and make them 'tidy'rlang
: Used for its tools for working with core language features of Rxfun
: Used to create output directoriesPlease ensure these packages are installed prior to analysis.
The analysis in the package requires a dataframe containing columns with the following information:
The package contains 8 core functions that can be used to generate many results from cells and clocks data.
?help
windows are available for each function. To access them, run ?function_name
for the given function. This help window will enumerate all of the arguments available for the function and their correct format. It will also provide a brief description and example use.
This function creates boxplots for cell types and a summary table of boxplot parameters for the entire dataset. If categorical variables are provided, the output will be stratified by levels in each categorical variable.
Usage:
cell_boxplots <- stratified_boxplots_cells(
data,
id,
study,
cell_types,
highlighted_cell_types = NULL,
categorical_variables = NULL,
colors = NULL,
output_dir = NULL,
save_plots = FALSE,
save_summaries = FALSE
)
The example below uses default colors and is stratified by "Categorical Variable" which has two levels. "Neu" was selected as a highlighted cell type due to the difference in its scale relative to the other cell types.
If you are unhappy with the colors, you are able to add your own. You may wish to set your desired color palette at the beginning of your script. Then, you can add it easily by setting the colors
argument equal to your defined palette (colors = my_color_palette
).
Summaries are also saved if save_summaries=TRUE
, and contain the basic quantities associated with each plot. The heading from this example would be structured as:
Cell Type | Categorical Variable | ymin | lower | middle | upper | ymax |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
These summaries are stored separately for each plot. So in this example, with a single categorical variable input, two plots and two summaries will be generated, if requested.
Since no output directory was provided (output_dir=NULL
), two folders will be created in the current working directory to save the files, as was requested: cellclockR_output/Plots and cellclockR_output/Summaries for the plots and their respective summaries. Plots are saved as .png, summaries as .csv.
This function creates boxplots for clocks and a summary table of boxplot parameters for the entire dataset. If categorical variables are provided, the output will be stratified by levels in each categorical variable.
Usage:
clock_boxplots <- stratified_boxplots_clocks(
data,
id,
study,
all_clocks,
highlighted_clocks = NULL,
categorical_variables = NULL,
colors = NULL,
output_dir = NULL,
save_plots = FALSE,
save_summaries = FALSE
)
The example below contains the clock boxplot, stratified by the same two-level categorical variable, "Categorical Variable." Because the scale of DunedinPACE differs substantially from the other epigenetic clocks, it was included in highlighted_clocks
to plot on a separate y-axis. The stratified plot is included below:
With save_summaries = FALSE
and save_plots = FALSE
, the output will not be saved. However, if you were to save a summary, it would contain the basic quantities associated with each plot, as it did for cells. The summary for each plot is stored separately. The heading from this example would be structured as:
Clock Type | Categorical Variable | ymin | lower | middle | upper | ymax |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
You may set only one of save_summaries
and save_plots
to TRUE
if you are only interested in one or the other.
This function calculates R-squared values for the relationship between age and various cell type and clock measures, both with and without control covariates (Batch, Array, Plate, Row, Column etc.). It produces a table with the R2 values for a basic model (age only), a full model (age plus control covariates), and the difference between them (delta-R2).
Usage:
result <- calculate_r2_w_age(
data,
study,
cell_types,
all_clocks,
age_column,
control_covariates,
output_dir = NULL,
save_results = FALSE
)
If save_results = TRUE
, a table will be saved in the specified output_dir
in the cellclockR/Tables/
folder with the given format:
Cell_or_Clock_Outcome | R2_with_age | R2_w_age_and_control_covariates | Delta_R2 | n_obs |
---|---|---|---|---|
DunedinPACE | 0.099 | 0.122 | 0.023 | 1795 |
PCGrimAge | 0.887 | 0.905 | 0.018 | 1795 |
PCHorvath1 | 0.794 | 0.802 | 0.008 | 1795 |
PCPhenoAge | 0.823 | 0.829 | 0.006 | 1795 |
Bas | 0.000 | 0.001 | 0.002 | 1795 |
Bmem | 0.001 | 0.001 | -0.001 | 1795 |
... | ... | ... | ... | ... |
This function calculates R-squared values for the relationship between user-defined control_covariates (Batch, Array, Plate, Row, Column etc.) and age-adjusted cell type and clock measures, both with and without the additional of user-defined continuous_variables. It produces a table with the R2 values for a basic model (control_covariates only), a full model (control_covariates plus continuous_variables), and the difference between them (delta-R2).
Usage:
result <- calculate_delta_r2_for_continuous(
data,
study,
cell_types,
all_clocks,
continuous_variables,
control_covariates,
age_column,
output_dir = NULL,
save_results = FALSE
)
If save_results = TRUE
, a table will be saved in the specified output_dir
in the cellclockR/Tables/
folder with the given format:
Cell_or_Clock_Outcome | Continuous_Variable | R2_no_continuous_variable | R2_w_continuous_variable | Delta_R2 | n_obs |
---|---|---|---|---|---|
PCHorvath1_resids | BMI | 0.039 | 0.038 | -0.001 | 1795 |
PCHorvath1_resids | SES | 0.039 | 0.038 | 0.000 | 1795 |
PCPhenoAge_resids | BMI | 0.034 | 0.033 | 0.000 | 1795 |
PCPhenoAge_resids | SES | 0.034 | 0.033 | -0.001 | 1795 |
PCGrimAge_resids | BMI | 0.159 | 0.163 | 0.004 | 1795 |
PCGrimAge_resids | SES | 0.159 | 0.159 | 0.000 | 1795 |
... | ... | ... | ... | ... | ... |
In this instance, the continuous variables of interest were "BMI" and "SES".
This function performs regressions with each clock variable against each cell variable, including specified control covariates (e.g. Batch, Array, Plate, Row, Column etc.) and optionally stratified by the levels of specified categorical variables. It calculates the beta coefficients and Wald confidence intervals for each correlation.
Usage:
result <- calculate_univariate_correlations(
data,
study,
cell_types,
all_clocks,
age_column,
stratify_by = NULL,
control_covariates = NULL,
output_dir = NULL,
save_results = FALSE
)
If save_results = TRUE
, a table will be saved in the specified output_dir
in the cellclockR/Tables/
folder with the given format:
Clock_Variable | Categorical_Variable | Categorical_Level | Cell_Variable | Beta | Lower_CI | Upper_CI |
---|---|---|---|---|---|---|
PCPhenoAge_resids | Sex | 1 | Bas_resids | -0.372 | -0.427 | -0.317 |
PCPhenoAge_resids | Sex | 0 | Bas_resids | -0.135 | -0.205 | -0.066 |
PCPhenoAge_resids | Sex | 1 | Bmem_resids | 0.075 | 0.016 | 0.135 |
PCPhenoAge_resids | Sex | 0 | Bmem_resids | 0.085 | 0.015 | 0.155 |
PCPhenoAge_resids | Sex | 1 | Bnv_resids | -0.237 | -0.294 | -0.180 |
PCPhenoAge_resids | Sex | 0 | Bnv_resids | -0.183 | -0.254 | -0.112 |
... | ... | ... | ... | ... | ... |
In this instance, the categorical variable provided in stratify_by
was "Sex".
This function calculates the change in R-squared (delta R2) when adding cell types to models predicting age-adjusted clock measures. It can stratify results by categorical variables and control for specified control_covariates (Batch, Array, Plate, Row, Column etc.).
Usage:
result <- calculate_cell_clock_delta_r2(
data,
study,
cell_types,
all_clocks,
age_column,
stratify_by = NULL,
control_covariates = NULL,
output_dir = NULL,
save_results = FALSE
)
If save_results = TRUE
, a table will be saved in the specified output_dir
in the cellclockR/Tables/
folder with the given format:
Clock_Variable | Categorical_Variable | Categorical_Level | R2_no_cells | R2_cells | Delta_R2 | n_obs |
---|---|---|---|---|---|---|
PCHorvath1_resids | Full Dataset | Full Dataset | -0.001 | 0.235 | 0.235 | 1795 |
PCPhenoAge_resids | Full Dataset | Full Dataset | -0.001 | 0.368 | 0.369 | 1795 |
PCGrimAge_resids | Full Dataset | Full Dataset | -0.001 | 0.287 | 0.288 | 1795 |
DunedinPACE_resids | Full Dataset | Full Dataset | -0.001 | 0.131 | 0.131 | 1795 |
PCHorvath1_resids | Education | 1 | 0.001 | 0.264 | 0.263 | 969 |
PCPhenoAge_resids | Education | 1 | 0.002 | 0.435 | 0.433 | 969 |
... | ... | ... | ... | ... | ... | ... |
In this case, the first categorical variable provided in stratify_by
was "Education".
This function calculates associations between categorical exposures and age-adjusted cell types or clock measures, optionally controlling for control covariates (Batch, Array, Plate, Row, Column etc.). The output provides the beta coefficients and Wald confidence intervals.
Usage:
result <- exposure_cell_clock_associations(
data,
study,
cell_types,
all_clocks,
age_column,
categorical_variables,
control_covariates = NULL,
output_dir = NULL,
save_results = FALSE
)
If save_results = TRUE
, a table will be saved in the specified output_dir
in the cellclockR/Tables/
folder with the given format:
Cell_or_Clock | Exposure | Beta | Lower_CI | Upper_CI |
---|---|---|---|---|
PCHorvath1_resids | Smoking | -0.397 | -0.488 | -0.306 |
PCPhenoAge_resids | Smoking | -0.372 | -0.463 | -0.281 |
PCGrimAge_resids | Smoking | -0.802 | -0.887 | -0.717 |
DunedinPACE_resids | Smoking | -0.322 | -0.414 | -0.230 |
CD4nv_resids | Smoking | 0.601 | 0.513 | 0.690 |
Bas_resids | Smoking | 0.097 | 0.004 | 0.190 |
... | ... | ... | ... | ... |
In this case, the exposure provided in categorical_variables
provided was "Smoking".
This function calculates associations between categorical exposures and age-adjusted clock measures, both with and without adjusting for cell types. It optionally controls for additional control covariates (e.g. Batch, Array, Plate, Row, Column etc.). The output provides the beta coefficients and Wald confidence intervals.
Usage:
result <- exposure_clock_association_with_without_cells(
data,
study,
cell_types,
all_clocks,
age_column,
categorical_variables,
control_covariates = NULL,
output_dir = NULL,
save_results = FALSE
)
If save_results = TRUE
, a table will be saved in the specified output_dir
in the cellclockR/Tables/
folder with the given format:
Clock_Variable | Exposure | Beta_no_cells | Lower_CI_no_cells | Upper_CI_no_cells | Beta_cells | Lower_CI_cells | Upper_CI_cells |
---|---|---|---|---|---|---|---|
PCHorvath1_resids | Female | -0.397 | -0.488 | -0.306 | -0.304 | -0.393 | -0.215 |
PCPhenoAge_resids | Female | -0.372 | -0.463 | -0.281 | -0.172 | -0.253 | -0.091 |
PCGrimAge_resids | Female | -0.802 | -0.887 | -0.717 | -0.623 | -0.705 | -0.541 |
DunedinPACE_resids | Female | -0.322 | -0.414 | -0.230 | -0.152 | -0.247 | -0.056 |
... | ... | ... | ... | ... | ... | ... | ... |
In this case, the exposure provided in categorical_variables
provided was "Female".
For questions or inquiries regarding the package, please contact Calen Ryan (cpr2139@cumc.columbia.edu) or Will Marella (wm2530@cumc.columbia.edu).
You can also open an issue in the cellclockr GitHub repository