broadinstitute / chronos

Modeling of time series data for CRISPR KO experiments
BSD 3-Clause "New" or "Revised" License
28 stars 9 forks source link

Chronos: an algorithm for inferring gene fitness effects from CRISPR knockout experiments.

A full description and benchmarking of Chronos 1 are available in a publication: https://doi.org/10.1186/s13059-021-02540-7

An additional preprint describing the changes made to Chronos 2 will be released when the underlying data is public, expected 2024.

When to use it

Chronos is well suited for any CRISPR KO experiment where:

Chronos may not work well for:

We strongly recommend having at least two sgRNAs per gene. This is true regardless of the algorithm you use.

Chronos is competitive with or superior to the other CRISPR algorithms we tested given readcounts from only one late time point, but it will perform even better with multiple late time points if your experiment has them.

Installation

Note on Mac M1 chips

As of 09/01/2023, pip install tensorflow should work on Macs with arm64.

Installing Chronos

If you have pip installed, you can install Chronos from pyPI with

$ pip install crispr_chronos

However, we recommend downloading this repository as well to run the vignette and download the DepMap trained Chronos parameters.

Chronos model requires python 3 with the packages tensorflow 2.x, numpy, pandas,h5py. However, additional modules require additional packages which will be installed by default if missing: patsy, statsmodels, scipy, matplotlib, seaborn, adjust_text, scikit-learn, umap, reportlab.

Getting Started

If you have jupyter notebook, you should run through Vignette.ipynb. This will both verify that you have a working installation and demonstrate a typical workflow for Chronos. Chronos is meant to be run in a python environment.

To run Chronos, you need a minimum of three Pandas dataframes:

  1. readcounts: A matrix of raw readcounts, where the columns are targeting sgRNAs, the rows are pDNA sequencing samples or replicate samples, and the entries are the number of reads of the given sgRNA in the given sample. Notice that in Chronos matrices, GUIDES and GENES are always COLUMNS and SAMPLES are always ROWS. Readcounts can have null values as long as no column or row is entirely null.

  2. _sequencemap: A table with at least four columns, sequence_ID, cell_line_name, pDNA_batch, and days, mapping sequencing samples to cell lines and pDNA measurements. sequence_ID should match the row names of the raw readcounts. days is the number of days between infection and when the sample was collected, should be integer or float. It will be ignored for pDNA samples. cell_line_name MUST be "pDNA" for pDNA samples. if, instead of pDNA, you are sequencing your cells at a very early time point to get initial library abundance, treat these as pDNA samples. If you don't have either, Chronos may not be the right algorithm for your experiment. pDNA_batch is needed when your experiment combines samples that have different pDNA references (within the same library). This is the case for Achilles because the PCR primer strategy has changed several times during the course of the experiment. pDNA samples belonging to the same batch will be combined into a single reference. If you don't have pDNA batches, just fill this column some value, such as "batch1".

  3. _guide_genemap: A table with at least two columns, sgrna and gene, mapping the sgRNAs to genes. Chronos will not accept sgRNAs that map to more than one gene. This is intentional. sgrna entries should match the columns in raw readcounts. gene can be in any format.

To benefit from improved normalization and allow Chronos to infer the overdispersion of screens, supplying a list or array of negative_control_sgrnas is also necessary. These are simply the sgRNAs which you believe should have no viability effect in any of your screens. It is much better to use cutting than noncutting controls, and as many as possible.

We've found that a small number of clones in CRISPR cell lines will exhibit dramatic outgrowth that seems unrelated to the intended CRISPR perturbation. We recommend you remove these in place by running

import chronos
chronos.nan_outgrowths(readcounts, sequence_map, guide_gene_map)

You can then initialize the Chronos model

model = chronos.Chronos(
    readcounts={'my_library': readcounts},
    sequence_map={'my_library': sequence_map},
    guide_gene_map={'my_library': guide_gene_map},
    negative_control_sgrnas={'my_library': negative_control_sgrnas}
)

This odd syntax is used because it allows you to process results from different libraries at the same time. If you have libraries 1 and 2, and readcounts, sequence maps, guide maps, and negative control sgRNAs for them, you would initialize Chronos as such:

model = chronos.Chronos(
    readcounts={'my_library1': readcounts1, 'my_library2': readcounts2},
    sequence_map={'my_library': sequence_map, 'my_library2': sequence_map2},
    guide_gene_map={'my_library': guide_gene_map, 'my_library2': guide_gene_map2},
    negative_control_sgrnas={'my_library1': negative_control_sgrnas1, 'my_library2': negative_control_sgrnas2}
)

Either way, you can then train Chronos by calling

model.train()

Once the model is trained, you can save all the parameters by calling

model.save("my_save_directory")

You can also directly access model parameters, for example:

gene_effect = model.gene_effect
guide_efficacy = model.guide_efficacy

gene_effect is the primary attribute you will be interested in in 99% of use cases. It is a numerical matrix indexed on rows by cell_line_name and on columns by gene, with values indicating the relative change in growth rate caused by successful knockout of the gene. 0 indicates no change, negative values a loss of viability, and positive values a gain of viability. NaNs in this matrix can occur because no sgRNAs targeting the gene

Note some parameters will be dictionaries or tables, because they are learned separately per library.

If you have labeled gene_level copy number data, Chronos has an option to correct the gene effect matrix. We recommend first globally normalizing the gene effect matrix so the median of all common essential gene scores is -1 and the median of all nonessential genes is 0. Unlike CERES outputs, we do NOT recommend normalizing per cell line. Chronos includes parameters like cell_line_growth_rate and cell_line_efficacy along with other regularization terms that help align data between cell lines.

gene_effect -= gene_effect.reindex(columns=my_nonessential_gene_list).median(axix=1).median()
gene_effect /= gene_effect.reindex(columns=my_essential_gene_list).median(axis=1).abs().median()
gene_effect_corrected, shifts = chronos.alternate_cn(gene_effect, copy_number)
chronos.write_hdf5(gene_effect_corrected, "my_save_directory/gene_effect.hdf5")

The copy number matrix needs to be aligned to the gene_effect_matrix. Additionally, we assume that it is in the current CCLE format: log2(relative CN + 1), where CN 1 means the relative CN matches the reference. This may still work fine with CN with different units, but has not been tested.

New functionality in Chronos 2.x includes two types of quality control reports, one you can run on your raw data, the other on the trained Chronos results, and the ability to load DepMap public Chronos runs and use the trained parameters for processing your own screens (if they are in a public DepMap library, currently just Avana and KY). See the vignette for details on how to do this.

Expected run times

The full Achilles dataset takes 3-4 hours to run a gcloud VM with 52 GB of memory. Training the vignette in this package should take around 2 minutes on a typical laptop.

Other Chronos Options

The Chronos model has a large number of hyperparameters which are described in the model code. Generally we advise against changing these. We've tested them in a wide variety of experimental settings and found the defaults work well. However, a few may be worth tweaking if you want to try and maximize performance. If you do choose to tune the hyperparameters, make sure you evaluate the results with a metric that captures what you really want to get out of the data. We decribe the hyperparameters that might be worth changing here.

Tools that are useful outside of Chronos:

Preprocessing tools:

Postprocessing tools:

QC reports (requires the matplotlib, seaborn, and reportlab packages):

Generally useful functions: