Robust and rapid alignment of single-cell and spatial transcriptomes

CytoSPACE is a novel computational tool for assigning single-cell transcriptomes to in situ spatial transcriptomics (ST) data. Our method solves single cell/spot assignment by minimizing a correlation-based cost function through a shortest augmenting path optimization routine.

The key innovations of our method are:

Unlike conventional methods which calculate cell type decompositions by spot, CytoSPACE yields a reconstructed tissue specimen with both high gene coverage and spatially-resolved scRNA-seq data suitable for downstream analysis.
CytoSPACE is highly robust to noise and returns globally optimal cell-to-spot assignments. (See the paper for full details.)
Unlike other methods which generally operate on pre-selected marker genes or on a shared embedding space (the latter of which can erase true biological variation), CytoSPACE uses the full transcriptome without the need for batch correction, helping it retain sensitivity to subtle cell states.
CytoSPACE is quick and simple to execute. It runs in minutes even with a single CPU on a personal laptop and requires no hyperparameter tuning or gene/feature selection.

CytoSPACE is available through a web interface at cytospace.stanford.edu, which enables users to run CytoSPACE with default settings without downloading the source code.

Installation instructions (5-10 minutes)

Expand section

1. Install Miniconda if not already available. 2. Clone this repository: ```bash git clone https://github.com/digitalcytometry/cytospace ``` 3. Navigate to `cytospace` directory: ```bash cd cytospace ``` 4. (5-10 minutes) Create a conda environment with the required dependencies: ```bash conda env create -f environment.yml ``` 5. Activate the environment you just created: ```bash conda activate cytospace_v1.1.0 ``` `cytospace_v1.1.0` is the environment name in the latest version of the file. If you have a different version, which you can check on the first line of the `environment.yml` file, replace `cytospace_v1.1.0` with your version name in the `conda activate` command. 6. (~30 seconds) Install CytoSPACE by executing: ```bash pip install . ``` 7. (Recommended, ~1 minute) Install package `lapjv` by executing: ```bash pip install lapjv==1.3.14 ``` We highly recommend you install this package, which provides a fast implementation of the default core optimization algorithm within CytoSPACE. __Step 7 may result in an error depending on your system__, as the package requires CPU support for AVX2 instructions that not all systems support. To determine if your system supports this package, it is generally easiest to simply attempt to install it as above. If it installs without problems, your system will support it! If you run into an error, it is likely that your system does not support it, and you can simply use one of the other options we have provided by specifying the `--solver-method` (`-sm`) parameter in your call to CytoSPACE. See [__Running CytoSPACE__](#running-cytospace) - __Choosing a solver__ for details. Please note that, if the package successfully installs but you receive an "illegal instruction" error while running CytoSPACE, you may be able to build the package instead with the following command: ```bash pip3 install git+https://github.com/src-d/lapjv ``` For more information, see the lapjv documentation page.

Input files

Expand section

By default, CytoSPACE requires 4 files as input. All files should be provided in a tab-delimited tabular input format (.txt) with no double quotations. Further formatting details for each input file are specified below. At the end of this section, we also provide instructions on using scripts to generate input files from Seurat objects. 1. __A scRNA-seq gene expression file:__ - The matrix must be genes (rows) by cells (columns). - The first row must contain the single cell IDs and the first column must contain the gene names. - The first column (gene names) must have a header. - The gene expression data should be represented as non-normalized counts. - All instances of duplicate gene names will be dropped at runtime.

2. __A cell type label file:__ - Cell type labels corresponding to the single cell IDs in the scRNA-seq gene expression matrix. - Cell type label strings should not have special characters. - The table should contain two columns, where column 1 contains the single cell IDs corresponding to the columns of the scRNA-seq matrix and column 2 contains the corresponding cell type labels. - The columns must have a header.

3. __A spatial transcriptomics (ST) gene expression file:__ - The matrix must be genes (rows) by ST spots (columns). - The first row must contain the ST spot IDs and the first column must contain the gene names. - The first column (gene names) must have a header. - The gene expression data should be represented as non-normalized counts. - All instances of duplicate gene names will be dropped at runtime.

4. __A spatial transcriptomics coordinates file:__ - A table consisting of 3 columns, where the first column contains the ST spot IDs corresponding to the columns of the ST gene expression matrix and columns 2 and 3 contain the row and column indices of each ST spot, respectively. - The columns must have a header.

### From Space Ranger outputs If the users are starting from Space Ranger outputs, they can provide the ST input files as a single tar.gz, __in place of__ both (3) gene expression and (4) coordinates. If a Space Ranger output is specified, CytoSPACE will automatically attempt to unzip the provided tarball and load the correponding ST expression and coordinates data.
If there are duplicate gene names in the gene expression file, they will be made unique as is done in Seurat. The tarball should only include the following: - A single H5 file (extension .h5) containing the ST gene expression - A single subdirectory containing the image data With the above items inside a directory named `spaceranger_input`, a tarball can be generated using the following command: ```bash tar -cvzf sr_input.tar.gz spaceranger_input ``` Or more generally: ```bash tar -cvzf [name_of_tarball] [name_of_directory] ``` An example file tree for an unzipped tarball is shown below on the left. If downloading from the public 10X Visium data, users can download the files shown below on the right.

Starting with CytoSPACE v1.1.0, the tool supports Space Ranger v2.0.0+ outputs. This is done by detecting files that have updated naming conventions (namely, the `tissue_positions.csv` file), then saving a copy of those files to the same directory with a naming convention that is compatible with scipy's functionality to read Visium datasets. ### Using sparse matrices for gene expression Starting with CytoSPACE v1.0.5, users may also provide the scRNA or ST gene expression as sparse matrices instead of tab- or comma-delimited files.
This will require a very specific set of file names in hopes of avoiding issues with parsing, and we recommend that users use the R helper scripts under `Prepare_input_files` to generate these input. Please see the subsections below for further information about these helper scripts. If providing input as sparse matrices, you will need three files under the same directory to represent one expression matrix: `[expression].mtx`, `genes.tsv`, and `cells.tsv`. The `.mtx` file will list the numeric values in sparse matrix format, and can also be in a compressed `[expression].mtx.gz` format. The filename inside the brackets (`expression`) may vary. The `genes.tsv` and `cells.tsv` will be the corresponding gene names and cell (or spot/barcode) names for the matrix, located in the same directory as the matrix file. Starting CytoSPACE v1.1.0, the tool also supports alternative naming conventions (`features.tsv` and `barcodes.tsv`), that are according to common Space Ranger and Cell Ranger sparse matrix outputs. These files support `.csv` extension, and compressed `.gz` formats as well. The files should have one entry on each line (no headers). If there are multiple columns in any of the files, CytoSPACE will take the first column as the gene/cell names. Please note that for common Cell Ranger and Space Ranger sparse matrix inputs, the first column of the gene file usually has the Ensembl IDs, while the gene symbols are in the second column, so you would need to change the order of the columns if your other input modality contains gene symbols for the gene names. The `[expression].mtx` file should be supplied as the argument to `--st-path` or `--scRNA-path`, in which case CytoSPACE will automatically try to locate the corresponding gene and cell files from the same directory. Please ensure the gene file includes gene names with the same naming convention (e.g., Ensembl gene ID, HGNC gene symbol, etc.) as those in the provided ST data.

Preparing input files from Seurat objects

If you have data in the form of Seurat objects, you can generate files formatted for CytoSPACE input via helper functions we have provided in the `R` script `generate_cytospace_from_seurat_object.R` in `cytospace/Prepare_input_files`. To use these helper functions, first import them from `generate_cytospace_from_seurat_object.R` by including ```bash source('/path/to/generate_cytospace_from_seurat_object.R') ``` in your R script. ### From scRNA-seq Seurat object For producing CytoSPACE inputs from scRNA Seurat objects, we provide the function `generate_cytospace_from_scRNA_seurat_object` which may be called as ```bash generate_cytospace_from_scRNA_seurat_object(scRNA_Seurat_Object, dir_out='.', fout_prefix='', write_sparse=FALSE, rna_assay='RNA') ``` within your R script.
`scRNA_Seurat_Object` (required) : input Seurat object
`dir_out` (optional, default is working directory) : the path to the output directory to store the results
`fout_prefix` (optional, default is none) : a prefix to add to output file names, which otherwise are generated as `scRNA_data.txt` and `cell_type_labels.txt`
`write_sparse` (optional, default is FALSE) : whether to save the expression data in sparse matrix format
`rna_assay` (optional, default is `RNA`) : which assay to take the count matrix from
Please note that `Idents(scRNA_Seurat_Object)` must be set to include cell types. ### From Spatial Seurat object For producing CytoSPACE inputs from ST Seurat objects, we provide the function `generate_cytospace_from_ST_seurat_object` which may be called as ```bash generate_cytospace_from_ST_seurat_object(ST_Seurat_Object, dir_out='.', fout_prefix='', write_sparse=FALSE, slice='slice1') ``` within your R script.
`ST_Seurat_Object` (required) : input Seurat object
`dir_out` (optional, default is working directory) : the path to the output directory to store the results
`fout_prefix` (optional, default is none) : a prefix to add to output file names, which otherwise are generated as `ST_data.txt` and `Coordinates.txt`
`write_sparse` (optional, default is FALSE) : whether to save the expression data in sparse matrix format
`slice` (optional, default is `slice1`) provides the name of your slice as stored in your Seurat object.

Running CytoSPACE

Expand section

After activating the `cytospace` conda environment via `conda activate cytospace`, CytoSPACE can be called from the command line from any folder using `cytospace`. More examples on how to run CytoSPACE are provided in the [__Example dataset for running CytoSPACE__](#example-dataset-for-running-cytospace) section below. A typical CytoSPACE run with default settings would look like this: ```bash cytospace \ --scRNA-path /path/to/scRNA_geneexpression \ --cell-type-path /path/to/scRNA_celllabels \ --st-path /path/to/ST_geneexpression \ --coordinates-path /path/to/ST_coordinates ``` Or with more condensed parameter names: ```bash cytospace \ -sp /path/to/scRNA_geneexpression \ -ctp /path/to/scRNA_celllabels \ -stp /path/to/ST_geneexpression \ -cp /path/to/ST_coordinates ``` Alternatively, if starting from a Space Ranger output, the command may look like this: ```bash cytospace \ --scRNA-path /path/to/scRNA_geneexpression \ --cell-type-path /path/to/scRNA_celllabels \ --spaceranger-path /path/to/spaceranger_output.tar.gz ``` ```bash cytospace -sp /path/to/scRNA_geneexpression \ -ctp /path/to/scRNA_celllabels \ -srp /path/to/spaceranger_output.tar.gz ``` For full usage details with additional options, see the [__Extended usage details__](#extended-usage-details) section below. ### Choosing a solver CytoSPACE provides three solver options. In short, we recommend using the default option `lapjv` if your system supports AVX2 (i.e., if you were able to successfully install it with `pip install lapjv==1.3.14` on Step 7 of the installation) and `lap_CSPR` otherwise. No options are required to use the default solver `lapjv`. To use `lap_CSPR` instead, pass the argument `-sm lap_CSPR` to your `cytospace` call. For full solver details, see the [__CytoSPACE solver options__](#cytospace-solver-options) section below.

Other ways CytoSPACE can be run

- You can import methods or functions from `CytoSPACE` in python and modify/create your own pipeline. For example: ```python from cytospace import cytospace for mean_cell_numbers in [5, 10, 20]: cytospace.main_cytospace(..., mean_cell_numbers=mean_cell_numbers) ```

CytoSPACE outputs

Expand section

CytoSPACE will produce six output files by default. 1. ```cell_type_assignments_by_spot.pdf```
Heatmaps of cell type assignments within the ST sample. Along with a plot showing the total number of cells mapped to each spot, these show the spatial distribution of cell type assignments. Color bars indicate the number of cells of the respective cell type inferred per spot. 2. ```cell_type_assignments_by_spot_jitter.pdf```
A single scatterplot showing all assigned cells by their spot location. Each cell is colored based on its cell type. 3. ```assigned_locations.csv```
This file will provide the assigned locations of each single cell mapped to ST spots. As some cells may be mapped to multiple locations depending on the size of the input scRNA-seq set, new cell IDs (`UniqueCID`) are assigned to each cell and given in the first column. The second column includes original cell IDs (`OriginalCID`); the third column includes corresponding cell types (`CellType`); the fourth column includes assigned spot IDs (`SpotID`); and the fifth and sixth columns respectively include `row` and `column` indices, or xy-coordinates such as `X` and `Y` if provided in the initial coordinates file, of the corresponding spots. 4. ```assigned_expression```, a directory with `barcodes.tsv`, `genes.tsv`, and `matrix.mtx` (CytoSPACE v1.0.6+)
This represents the gene expression of the resulting assignments, with rows as genes and columns as cells (`UniqueCID` of `assigned_locations.csv`), as recovered from the original input scRNA matrix. The expression data can be read in for downstream analyses using functions such as Seurat `Read10X()` or SciPy `io.mmread()`. As with `assigned_locations.csv`, please note that there may be cells that are assigned to multiple locations and therefore appear multiple times in the expression matrix (under different `UniqueCID`s).
For compatibility with the default parameters of Seurat `Read10X()`, `genes.tsv` lists the gene names in its second column, with the first column filled with `NA`s. 5. ```cell_type_assignments_by_spot.csv```
This file gives the raw number of cells of each cell type per spot by `SpotID` as well as the total number of cells assigned to that spot. 6. ```fractional_abundances_by_spot.csv```
This file gives the fractional abundance of cell types assigned to each spot by `SpotID`. 7. ```unassigned_locations.csv```
This file contains the list of spots (locations) where no cells have been assigned by the algorithm. The columns include the spot IDs (rownames), the row (`row`) and column (`col`) indices of the spots, and the total number of cells in each spot (`Number of cells`), which is 0 for all the spots in this file. 8. ```log.txt```
This file contains a log of CytoSPACE run parameters and running time.

Frequently asked questions (FAQ)

Expand section

1. My ST dataset comes from a platform other than 10x Visium. Which additional parameters should I specify?
Please refer to further instructions in the corresponding sections below:
[__Running CytoSPACE on legacy ST data__](#running-cytospace-on-legacy-st-data) and [__Running CytoSPACE on single-cell ST data__](#running-cytospace-on-single-cell-st-data). 2. My scRNA-seq dataset comes in a format other than a UMI count matrix.
You may want to make the following two changes to the default CytoSPACE workflow.
(1) CytoSPACE's internal algorithm for estimating cell type fractions was generally written with a UMI count matrix in mind. As an alternative to using the internal algorithm, you may provide a `--cell-type-fraction-estimation-path` file instead. Please see [__Advanced Options__](#advanced-options) - __User-provided fractional composition of each cell type__ for more information.
(2) CytoSPACE (v1.0.4+) downsamples scRNA-seq datasets to a certain number of transcripts per cell (1500 by default) prior to assignment so that the assignment is not dependent on the total transcript count of each cell. You can turn this feature off by appending a `--downsample-off` flag to the CytoSPACE call, and may instead choose to provide previously downsampled scRNA-seq data as input. 3. During runtime, I get a `Killed` / `Terminated` / `Segmentation fault` / `concurrent.futures.process.BrokenProcessPool` error.
While these errors could arise from a variety of reasons, it is likely that the CytoSPACE run needs more memory than what is available.
We provide a subsampling routine where the ST datasets are partitioned into smaller subsets and evaluated one subset at a time, which reduces memory requirements. Please see [__Advanced Options__](#advanced-options) - __Spot subsampling for parallelization__ for more information.
If you are experiencing this error with a single-cell ST dataset, it will be helpful to reduce the `-noss` parameter instead. 4. My input data are very sparse. Is there a way to provide sparse matrices as input?
As of CytoSPACE v1.1.0, CytoSPACE supports sparse matrices as input, for both single cell and spatial datasets. This applies both to the main cytospace function and to the get_cellfracs_seuratv3 R script for estimation of the fractional composition of each cell type. While the input still needs to follow a fixed logic of file naming, we broadened compatibility to include the most common naming conventions and extensions for different versions of Cell Ranger and Space Ranger sparse matrix outputs. Please refer to the [__Input Files__](#input-files) - __Using sparse matrices for gene expression__ section for details. 5. Providing the output from Space Ranger (v2.0.0+) results in an error.
As of CytoSPACE v1.1.0, CytoSPACE supports the output from Space Ranger (v2.0.0+). Please refer to the [__Input Files__](#input-files) - __From Space Ranger outputs__ section for details.

Example dataset for running CytoSPACE

Expand section

For users to test CytoSPACE, we have included files for an example run: - A HER2+ breast cancer scRNA-seq atlas by Wu et al. (Nature Genetics, 2021) and a HER2+ breast cancer FFPE specimen profiled by the Visium platform (10x Genomics). Default parameters were selected with Visium samples in mind and are appropriate here. ### Download example datasets A zip file containing the example dataset can be downloaded from the following link: - Breast cancer ### Command for running example analysis: Once the example files are downloaded and unzipped, the commands below can be run from inside the unzipped directory: ```bash cytospace -sp brca_scRNA_GEP.txt -ctp brca_scRNA_celllabels.txt -stp brca_STdata_GEP.txt -cp brca_STdata_coordinates.txt -o cytospace_results_brca -sm lap_CSPR ``` Please note that here we use the `lap_CSPR` solver for compatibility. If your system supports AVX2 intrinsics, you can run the same commands without the final argument to use the `lapjv` solver instead. __The CytoSPACE run should take around 5 minutes.__ ### CytoSPACE output files for example breast cancer data The main output from a CytoSPACE run is the file named `assigned_locations.csv`, which provides the ST spots to which the single cells have been assigned.

The CytoSPACE results are visualized in heatmaps saved as `cell_type_assignments_by_spot.pdf` showing the distribution of single cells across ST spots for each cell type. Color bars indicate the number of cells of the respective cell type inferred per spot. Below are the heatmaps produced for the example BRCA data.

For comparison, consider the pathologist annotations of this ST sample as provided by 10x:

CytoSPACE also provides a scatterplot showing cells of all types at once near their spot location, saved as `cell_type_assignments_by_spot_jitter.pdf`. Each cell is colored by its cell type. Below is the plot produced for the example BRCA data.

The number of cells per spot by cell type as well as in total are provided in the file `cell_type_assignments_by_spot.csv`. Fractional abundances of each cell type are returned in the file `fractional_abundances_by_spot.csv`. A log file recording CytoSPACE inputs and running times is output in the file `log.txt`. A zip file of the expected CytoSPACE outputs (with CytoSPACE v1.0.0 using the `lap_CSPR` solver) are available to download at the following link: - Breast cancer results

Simulated datasets

In addition to the example dataset above, the simulated datasets that we have generated for evaluating robustness of CytoSPACE across different conditions are available for download below. They were generated using annotated Slide-seq datasets of mouse cerebellum and hippocampus sections from Rodriques et al. (Science, 2019). Each simulated dataset contains subdirectories for data generated using different spot resolutions (5, 15, and 30 cells per spot), as well as an `scRNA` subdirectory containing reference single-cell datasets with perturbations in a defined percentage of genes. For more information, please see the Methods section of the [paper](https://www.nature.com/articles/s41587-023-01697-9). 1. Cerebellum 2. Hippocampus

Running CytoSPACE on legacy ST data

Expand section

By default, the CytoSPACE parameters have been optimized for standard 10x Visium spatial slides. Datasets generated by the legacy ST platform can be run with similar commands, but we recommend that the following parameters be adjusted: 1. `--mean_cell_numbers`, or `-mcn`, should be set to `20`. The legacy ST platform has larger spot sizes, so we recommend mapping an average of 20 cells per spot. 2. `--geometry`, or `-g` should be set to `square`. This will allow the plot function to shape each spot as a square rather than a hexagon. Similar to the example breast cancer dataset above, we provide an example dataset below: - A melanoma scRNA-seq atlas by Tirosh et al. (Science, 2016), and a melanoma specimen profiled by the legacy ST platform (Thrane et al., Cancer Research, 2018). The zip file containing the dataset can be downloaded here. Running CytoSPACE with the command below generates the results shown here . The format of the output will be the same as the breast cancer dataset above. Please note that here we specify the `-ctfep` parameter instead of using CytoSPACE's internal algorithm for estimating cell fractions (see [__Advanced options__](#advanced-options) - __User-provided fractional composition of each cell type__) as the scRNA-seq atlas used as reference was generated using Smart-seq2. If using CytoSPACE v1.0.4+, the `--downsample-off` flag should additionally be specified. ```bash cytospace -sp melanoma_scRNA_GEP.txt -ctp melanoma_scRNA_celllabels.txt -stp melanoma_STdata_slide1_GEP.txt -cp melanoma_STdata_slide1_coordinates.txt -ctfep melanoma_cell_fraction_estimates.txt -o cytospace_results_melanoma -mcn 20 -g square -sm lap_CSPR ```

Running CytoSPACE on single-cell ST data

Expand section

While designed for Visium-type data in which most spots contain RNA from multiple cells, CytoSPACE can also be used with single-cell resolution spatial data such as Vizgen's MERSCOPE platform. We expect this extension to be useful for reducing noise and expanding transcriptome coverage of each cell in the ST data, which in turn could allow for identifying spatially-dependent changes across genes more diverse than what a typical single-cell resolution ST platform alone can provide. For the single-cell resolution mode, CytoSPACE partitions the ST data into smaller subsets and utilizes multiple CPU cores to assign downsampled versions of the reference scRNA-seq data to these regions. To run CytoSPACE in single-cell mode, please add the following parameters to your command: 1. `--single-cell` (`-sc`), which tells CytoSPACE to run in single-cell mode. 2. Cell types for the ST dataset. Please note that for the single-cell mode, CytoSPACE does not support the internal estimation of cell type fraction, and the users are expected to specify one of the two options below.
(1) `--st-cell-type-path` (`-stctp`)
__If the cell types for individual spots are available, we recommend using this option.__ This file will list the cell type labels for each spot, in the same format as the scRNA-seq cell type labels specified under `--cell-type-path`. All of the cell types present in `--st-cell-type-path` must also be present in `--cell-type-path`.
(2) `--cell-type-fraction-estimation-path` (`-ctfep`)
However, if the user does not have access to the cell types for each individual spot, they can instead use this option. See the [__Advaced Options__](#advanced-options) - __User-provided fractional composition of each cell type__ section regarding how this file should be formatted. 3. `--number-of-processors` (`-nop`), which denotes the number of cores to use. 4. `--number-of-selected-spots` (`-noss`), which denotes the number of ST spots in each subset. We generally recommend `-noss 10000`. To run CytoSPACE with single-cell resolution spatial data: ```bash cytospace --single-cell \ --scRNA-path /path/to/scRNA_geneexpression \ --cell-type-path /path/to/scRNA_celllabels \ --st-path /path/to/ST_geneexpression \ --coordinates-path /path/to/ST_coordinates \ --st-cell-type-path /path/to/ST_celllabels \ --number-of-processors NUMBER_OF_PROCESSORS \ --number-of-selected-spots NUMBER_OF_SELECTED_SPOTS ``` Or with more condensed parameter names: ```bash cytospace -sc \ -sp /path/to/scRNA_geneexpression \ -ctp /path/to/scRNA_celllabels \ -stp /path/to/ST_geneexpression \ -cp /path/to/ST_coordinates \ -stctp /path/to/ST_celllabels \ -nop NUMBER_OF_PROCESSORS \ -noss NUMBER_OF_SELECTED_SPOTS ``` A zip file of example single cell inputs is available to download from Google Drive here. To run CytoSPACE with this example dataset, run the following command from the location of the unzipped inputs and with your CytoSPACE conda environment active: ```bash cytospace \ -sp HumanColonCancerPatient2_scRNA_expressions_cytospace.tsv \ -ctp HumanColonCancerPatient2_scRNA_annotations_cytospace.tsv \ -stp HumanColonCancerPatient2_ST_expressions_cytospace.tsv \ -cp HumanColonCancerPatient2_ST_coordinates_cytospace.tsv \ -stctp HumanColonCancerPatient2_ST_celltypes_cytospace.tsv \ -o cytospace_results_crc \ -sm lap_CSPR \ -sc -noss 10000 -nop 2 ``` Running CytoSPACE in the `--single-cell` mode will output the assignments `assigned_locations.csv`, the plot `cell_type_assignments_by_spot_single_cell.pdf`, and the log file `log.txt`. The plot generated will be a scatterplot of the cells colored by cell type, as shown below for the example dataset. The full results for the example dataset using the above command is available for download here.

Advanced options

While the default options are recommended for most use cases, we do provide additional advanced options.

User-provided fractional composition of each cell type (-ctfep)

To account for the disparity between scRNA-seq and ST data in the number of cells per cell type, CytoSPACE requires the fractional composition of each cell type in the ST tissue. By default, CytoSPACE will generate this information by internally calling the `get_cellfracs_seuratv3.R` script using the input files. This script uses `Seurat v3`, which is installed as part of the CytoSPACE environment. We highly recommend using `Seurat v3` over `Seurat v4` for the purposes of cell type fraction estimation. While our provided script uses Spatial Seurat, there is a diverse set of other approaches available such as cell2location, SPOTlight, or CIBERSORTx. Users may choose to skip CytoSPACE's internal algorithm and instead provide their own file for the estimated cell type composition of the ST dataset, specified with the `--cell-type-fraction-estimation-path` (`-ctfep`) flag. In particular, we recommend that a separate `-ctfep` file be provided if the reference scRNA-seq dataset comes from technologies that are not based on UMI counts, such as Smart-seq. The provided file must be a table consisting of 2 rows with row names, where the first row contains the cell types and the second row contains the cell fractions of each cell type represented as proportions between 0 and 1. __Please make sure that the cell type labels in the first row match the labels present in the cell type label file, and that the cell type fractions sum to one. Row names must be present for both rows.__

User-provided estimates of number of cells per spot (-ncpsp)

Rather than using the internal mechanism of CytoSPACE for estimating the number of cells per spot, users can provide their own estimates (from image segmentation, for example) in a two-column file with header, in which the first column contains spot IDs and the second contains the number of cells predicted per spot:

To run CytoSPACE with this option, pass the flag `-ncpsp` or `--n-cells-per-spot-path` followed by the file location.

Spot subsampling for parallelization (-sss)

The memory and runtime required for running CytoSPACE may vary based on the number of spots. To allow for CytoSPACE to run under different conditions, we provide an option to partition the estimated number of cells in the ST sample into smaller chunks, where similarly downsampled reference scRNA-seq data are then assigned using multiple CPU cores. Please note that this option is only for non-single-cell ST datasets (10x Visium and legacy ST); users running CytoSPACE on single-cell ST data should instead modify the `-noss` parameter to achieve the same effect. The users can use this option by specifying the `--sampling-sub-spots` (`-sss`) flag, along with the desired number of subsampled cells per partition (`--number-of-selected-sub-spots`, or `-nosss`) and the number of cores to be used (`--number-of-processors`, or `-nop`). Please note that the right `-nosss`/`-nop` combination will be highly dependent on the user's system configurations, and one may need to try different combinations to see which one allows for a successful run. Ideally, using the highest possible `-nosss` without going out of memory would be the most effective. We noticed that `-sss -nosss 3000 -nop 2` works well for the example BRCA dataset on an environment with 8GB RAM. The following example command will run CytoSPACE on the example breast cancer dataset, assigning scRNA-seq data to 5000 cells at a time using 2 cores: ```bash cytospace \ -sp brca_scRNA_GEP.txt \ -ctp brca_scRNA_celllabels.txt \ -stp brca_STdata_GEP.txt \ -cp brca_STdata_coordinates.txt \ -o cytospace_results_brca \ -sm lap_CSPR \ -sss -nosss 5000 -nop 2 ```

Alternative distance metric (-dm)

By default, CytoSPACE uses Pearson correlation to compare cell and spot transcriptomes. Users can choose to use Spearman correlation or Euclidean distance instead by passing `-dm Spearman_correlation` or `-dm Euclidean` respectively with the function call.

Setting a new random seed (-se)

While the CytoSPACE algorithm is mostly deterministic, the initial step of sampling cells to be mapped is done at random. To provide an alternative random seed resulting in a different random sampling of cells, users can pass `-se` followed by the desired (integer) seed with the function call. The default random seed for CytoSPACE is 1.

Alternative handling of sampling (-sam)

CytoSPACE starts by creating a pool of cells that matches what is expected within the ST data. By default, this is done by resampling single cells to achieve the overall cell type fractions and total cell numbers estimated in the tissue. We recommend that CytoSPACE be run with this default setting for all real data analyses. However, we provide an additional option to generate new "place-holder" cells by sampling from the distribution of gene counts within each cell type instead, and used this option for ensuring uniqueness of mapped cells for benchmarking on simulated data. To run CytoSPACE with this alternative mode, users can pass `-sam place_holders` with the function call. When running in place-holder mode, the `assigned_expression` directory will not be generated; instead, the gene expression of the newly generated "place-holder" cells will be saved as part of the output under `new_scRNA.csv`.

Method extension: mapping quality

While CytoSPACE's formulation as a linear assignment problem guarantees an optimal solution given its cost function, there is no underlying probabilistic framework for estimating mapping uncertainty. One possibility is to determine whether a given cell type belongs to a given spot after mapping - that is, whether a spot contains at least one cell of the same cell type. Notably, this does not distinguish between cells of the same cell type for quality of fit. As such a protocol provides some measure of mapping quality, albeit incomplete, we provide a helper script that implements this via a support vector machine that produces and trains on pseudo-bulks generated from the input scRNA-seq data. This script, `uncertainty_quantification.R`, takes as input the path to the ST dataset count matrix file, the scRNA-seq count matrix file, and the CytoSPACE output file `assigned_locations.csv`, and returns an appended output file with confidence scores in `assigned_locationswConfidenceScores.csv`. The command to run this script following a completed CytoSPACE run is as follows: ```bash Rscript uncertainty_quantification.R /path/to/ST_geneexpression /path/to/scRNA_geneexpression /path/to/assigned_locations.csv ``` For interpreting confidence scores, we recommend a cutoff of 0.1, with higher scores indicating increased confidence that a spot contains at least one cell of the same cell type. Please note that `uncertainty_quantification.R` requires separate dependencies from those included in the provided `environment.yml` file for the `cytospace` conda environment. This script should be run in a separate environment with the following R packages installed: `Seurat` (must be v4; tested with v4.0.1), `data.table` (tested with v1.14.0), and `e1071` (tested with v1.7.8).

Extended usage details

Expand section

``` usage: cytospace [-h] -sp SCRNA_PATH -ctp CELL_TYPE_PATH [-stp ST_PATH] [-cp COORDINATES_PATH] [-srp SPACERANGER_PATH] [-stctp ST_CELL_TYPE_PATH] [-ctfep CELL_TYPE_FRACTION_ESTIMATION_PATH] [-ncpsp N_CELLS_PER_SPOT_PATH] [-o OUTPUT_FOLDER] [-op OUTPUT_PREFIX] [-mcn MEAN_CELL_NUMBERS] [--downsample-off] [-smtpc SCRNA_MAX_TRANSCRIPTS_PER_CELL] [-sc] [-noss NUMBER_OF_SELECTED_SPOTS] [-sss] [-nosss NUMBER_OF_SELECTED_SUB_SPOTS] [-nop NUMBER_OF_PROCESSORS] [-sm {lapjv,lapjv_compat,lap_CSPR}] [-dm {Pearson_correlation,Spearman_correlation,Euclidean}] [-sam {duplicates,place_holders}] [-se SEED] [-p] [-g GEOMETRY] [-nc NUM_COLUMN] [-mp MAX_NUM_CELLS_PLOT] CytoSPACE is a computational strategy for assigning single-cell transcriptomes to in situ spatial transcriptomics (ST) data. Our method solves single cell/spot assignment by minimizing a correlation-based cost function through a linear programming-based optimization routine. optional arguments: -h, --help show this help message and exit -stp ST_PATH, --st-path ST_PATH Path to spatial transcriptomics data (expressions) -cp COORDINATES_PATH, --coordinates-path COORDINATES_PATH Path to transcriptomics data (coordinates) -srp SPACERANGER_PATH, --spaceranger-path SPACERANGER_PATH Path to SpaceRanger tar.gz data file -stctp ST_CELL_TYPE_PATH, --st-cell-type-path ST_CELL_TYPE_PATH Path to ST cell type file (recommended for single-cell ST) -ctfep CELL_TYPE_FRACTION_ESTIMATION_PATH, --cell-type-fraction-estimation-path CELL_TYPE_FRACTION_ESTIMATION_PATH Path to ST cell type fraction file (recommended for bulk ST) -ncpsp N_CELLS_PER_SPOT_PATH, --n-cells-per-spot-path N_CELLS_PER_SPOT_PATH Path to number of cells per ST spot file -o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER Relative path to the output folder -op OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Prefix of results stored in the 'output_folder' -mcn MEAN_CELL_NUMBERS, --mean-cell-numbers MEAN_CELL_NUMBERS Mean number of cells per spot, default 5 (appropriate for Visium). If analyzing legacy spatial transcriptomics data, set to 20 --downsample-off Turn off downsampling for scRNA-seq data -smtpc SCRNA_MAX_TRANSCRIPTS_PER_CELL, --scRNA_max_transcripts_per_cell SCRNA_MAX_TRANSCRIPTS_PER_CELL Number of transcripts per cell to downsample scRNA-seq dataset to. This allows for assignments that are not dependent on the overall expression level -sc, --single-cell Use single-cell spatial approach if specified -noss NUMBER_OF_SELECTED_SPOTS, --number-of-selected-spots NUMBER_OF_SELECTED_SPOTS Number of selected spots from ST data used in each iteration -sss, --sampling-sub-spots Sample subspots to limit the number of mapped cells if specified -nosss NUMBER_OF_SELECTED_SUB_SPOTS, --number-of-selected-sub-spots NUMBER_OF_SELECTED_SUB_SPOTS Number of selected subspots from ST data to limit the number of mapped cells -nop NUMBER_OF_PROCESSORS, --number-of-processors NUMBER_OF_PROCESSORS Number of processors used for the analysis -sm {lapjv,lapjv_compat,lap_CSPR}, --solver-method {lapjv,lapjv_compat,lap_CSPR} Which solver to use for the linear assignment problem, default 'lapjv' -dm {Pearson_correlation,Spearman_correlation,Euclidean}, --distance-metric {Pearson_correlation,Spearman_correlation,Euclidean} Which distance metric to use for the cost matrix, default 'Pearson_correlation' -sam {duplicates,place_holders}, --sampling-method {duplicates,place_holders} Which underlying method to use for dealing with duplicated cells, default 'duplicates' -se SEED, --seed SEED Set seed for random generators, default 1 -p, --plot-off Turn create plots on/off -g GEOMETRY, --geometry GEOMETRY ST geometry, either 'honeycomb' or 'square' accepted -nc NUM_COLUMN, --num-column NUM_COLUMN Number of columns in figure -mp MAX_NUM_CELLS_PLOT, --max-num-cells-plot MAX_NUM_CELLS_PLOT Maximum number of cells to plot in single-cell visualization Required arguments: -sp SCRNA_PATH, --scRNA-path SCRNA_PATH Path to scRNA-Seq data -ctp CELL_TYPE_PATH, --cell-type-path CELL_TYPE_PATH Path to cell type labels ``` You can see this list of variables and default values for running CytoSPACE from the commmand line as well at any time by calling `cytospace` along with the `-h` or `--help` flag, i.e., `cytospace -h`.

CytoSPACE solver options

Expand section

1. `lapjv` __(Recommended for most systems)__ By default, CytoSPACE calls the `lapjv` solver from package `lapjv`. This solver is a fast implementation of the Jonker-Volgenant shortest augmenting path assignment algorithm and returns a globally optimal solution given the objective function as defined in our [paper](https://www.nature.com/articles/s41587-023-01697-9). As noted above, however, this package is not supported on all systems as it achieves its speedup through use of AVX2 instructions. This solver will be selected by default and can be specified explicitly by passing arguments `--solver-method lapjv` or `-sm lapjv` to `cytospace`. 2. `lap_CSPR` __(Recommended for systems not supporting `lapjv`)__ A second solver option is the `linear_assignment` method from the `ortools` package. This solver uses a different method than the first and third options, an assignment algorithm called the cost scaling push relabel method. This algorithm approximates assignment costs to integer values and loses some numerical precision in doing so. Therefore, while it returns a globally optimal solution __after approximation__ given the objective function defined in the paper, it will return similar but generally not identical results to the first two methods. This solver has a similar running time to the first option and is a good option for systems not supporting the `lapjv` package. This solver can be selected by passing arguments `--solver-method lap_CSPR` or `-sm lap_CSPR` to `cytospace`. 3. `lapjv_compat` A third solver option implements the `lapjv` solver from package `lap`. Like the first option `lapjv`, this solver also implements the Jonker-Volgenant shortest augmenting path assignment algorithm to return the same globally optimal solution given the objective function defined in the paper. Furthermore, it is broadly supported and should work on all standard operating systems. However, it takes 3-4 times as long to run as the first solver option (the `lapjv` solver from the `lapjv` package), so we only recommend it for systems that do not support the first option. This solver can be selected by passing arguments `--solver-method lapjv_compat` or `-sm lapjv_compat` to `cytospace`.

Updating local installations

Expand section

To update your local installation of CytoSPACE following updates of this GitHub repository, navigate to your `cytospace` directory and execute the following commands: ```bash git pull conda env update --name cytospace --file environment.yml conda activate cytospace pip install . ``` If you have made local updates to your version of the CytoSPACE source code, you should execute ```bash pip install . ``` before running.

Authors

CytoSPACE was developed in the Newman Lab by

Milad R. Vahid (miladrv)
Erin L. Brown (erinlbrown)
Chloé B. Steen (cbsteen)
Wubing Zhang (WubingZhang)
Hyun Soo Jeon (hsjeon-k)
Aaron M. Newman (aaronmnewman)

Contact

If you have any questions, please contact the CytoSPACE team at cytospaceteam@gmail.com.

License

Please see the LICENSE file.

Citation

If you use CytoSPACE, please cite:

High-resolution alignment of single-cell and spatial transcriptomes with CytoSPACE (Nature Biotechnology 2023)
Milad R. Vahid, Erin L. Brown, Chloé B. Steen*, Wubing Zhang, Hyun Soo Jeon, Minji Kang, Andrew J. Gentles, Aaron M. Newman.
https://www.nature.com/articles/s41587-023-01697-9

digitalcytometry / cytospace

readme

Robust and rapid alignment of single-cell and spatial transcriptomes

Installation instructions (5-10 minutes)

Input files

Running CytoSPACE

CytoSPACE outputs

Frequently asked questions (FAQ)

Example dataset for running CytoSPACE

Running CytoSPACE on legacy ST data

Running CytoSPACE on single-cell ST data

Advanced options

Extended usage details

CytoSPACE solver options

Updating local installations

Authors

Contact

License

Citation