For more details of our method, please refer to: https://doi.org/10.1038/s41467-023-44560-w
Recent advances in subcellular imaging transcriptomics platforms have enabled spatial mapping of the expression of hundreds of genes at subcellular resolution and provide topographic context to the data. This has created a new data analytics challenge to correctly identify cells and accurately assign transcripts, ensuring that all available data can be utilised. To this end, we introduce BIDCell, a self-supervised deep learning-based framework that incorporates cell type and morphology information via novel biologically-informed loss functions. We also introduce CellSPA, a comprehensive evaluation framework consisting of metrics in five complementary categories for cell segmentation performance. We demonstrate that BIDCell outperforms other state-of-the-art methods according to many CellSPA metrics across a variety of tissue types of technology platforms, including 10x Genomics Xenium. Taken together, we find that BIDCell can facilitate single-cell spatial expression analyses, including cell-cell interactions, enabling great potential in biological discovery.
Note: A GPU with at least 12GB VRAM is strongly recommended for the deep learning component, and 32GB RAM for data processing. We ran BIDCell on a Linux system with a 12GB NVIDIA GTX Titan V GPU, Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz with 16 threads, and 64GB RAM.
conda create --name bidcell python=3.10
conda activate bidcell
python -m pip install bidcell
Installation of dependencies typically requires a few minutes.
Note: We are actively finding and fixing issues. If you encounter
[xcb] Unknown sequence number while processing queue
, try running without a GUI, e.g. through PuTTY. Please let us know any other issues you may find. Thank you.
Please download the BIDCell GitHub repository first. A small subset of Xenium breast cancer data is provided as a demo. The example .yaml file can be found in bidcell/example_params/small_example.yaml
. Use the following to run all the steps to verify installation:
python example_small.py
Or:
from bidcell import BIDCellModel
BIDCellModel.get_example_data()
model = BIDCellModel("params_small_example.yaml")
model.run_pipeline()
Parameters are defined in .yaml files. Examples are provided for 4 major platforms, including Xenium, CosMx, MERSCOPE, and Stereo-seq. BIDCell may also be applied to data from other technologies such as MERFISH.
Note: Please modify
cpus
to suit your system. Highercpus
allow faster runtimes but may freeze your system.
Run the following to obtain examples:
from bidcell import BIDCellModel
BIDCellModel.get_example_config("xenium")
BIDCellModel.get_example_config("cosmx")
BIDCellModel.get_example_config("merscope")
BIDCellModel.get_example_config("stereoseq")
This will copy the .yaml for the respective vendor into your working directory, for example xenium_example_config.yaml
.
The full dataset (Xenium Output Bundle In Situ Replicate 1) may be downloaded from https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast. The breast cancer reference data are provided with this package under data/sc_references
, or ./example_data/sc_references
if you have run example_small.py
. Please ensure the correct paths are provided for the parameters under files
in xenium_example_config.yaml
, in particular, the paths for the transcripts (transcripts.csv.gz
) and DAPI (morphology_mip.ome.tif
) files.
To run the entire pipeline (data processing, training, prediction, and extracting the cell-gene matrix):
from bidcell import BIDCellModel
model = BIDCellModel("xenium_example_config.yaml")
model.run_pipeline()
Alternatively, the pipeline can be broken down into 3 main stages:
from bidcell import BIDCellModel
model = BIDCellModel("xenium_example_config.yaml")
model.preprocess()
model.train()
model.predict()
Or, functions in preprocess
can be called individually:
from bidcell import BIDCellModel
model = BIDCellModel("xenium_example_config.yaml")
# model.stitch_nuclei() # for when nuclei images are separated into FOVs (e.g., CosMx)
model.segment_nuclei()
model.generate_expression_maps()
model.generate_patches()
model.make_cell_gene_mat(is_cell=False)
model.preannotate()
model.train()
model.predict()
If your machine/server has multiple GPUs, you may select the GPU to use, e.g. for GPU ID 3:
CUDA_VISIBLE_DEVICES=3
BIDCell uses single-cell reference data for improved results. These can be downloaded from public repositories such as TISCH2, Allen Brain Map, and the Human Cell Atlas.
Please see the provided breast cancer single-cell reference and positive/negative marker files (sc_breast.csv
, sc_breast_markers_pos.csv
, and sc_breast_markers_neg.csv
) as a template.
The reference csv file contains average expressions for all of the genes in the spatial transcriptomic dataset of different cell types. You may choose an appropriate list of cell types to include for your data.
The positive and negative markers files contain the respective marker genes for each cell type. The positive and negative markers were those with expressions in the highest and lowest 10 percentile for each cell type of a tissue sample. We found that removing positive markers that were common to at least a third of cell types in each dataset was appropriate across various datasets. Using a larger number of positive markers tends to increase the size of predicted cells. Manual curation and alternative approaches to determine the marker genes can also be used.
Only <1,000 genes are needed to perform segmentation. Specify a selection of genes in a file (see Stero-seq example).
Some example reference data may be found here: https://github.com/SydneyBioX/scClassify?tab=readme-ov-file#pretrained-models
The default is UNet3+ https://arxiv.org/abs/2004.08790, and we have found it to perform well across different technologies and tissue types. To use a different architecture, select from a list of popular backbones or define your own:
model_params.name
in the .yaml file with an encoder from https://segmentation-modelspytorch.readthedocs.io/en/latest/index.htmlSegmentationModel
class in model.py
If you receive the error: pickle.UnpicklingError: pickle data was truncated
, try reducing cpus
Performing segmentation at a higher resolution requires a larger patch size, thus more GPU memory.
Expected outputs:
_connected.tif
dataset_xenium_breast1_small/model_outputs/2023_09_06_11_55_24/test_output/epoch_{test_epoch}_step_{test_step}_connected.tif
expr_mat.csv
containing gene expressions of segmented cells
dataset_xenium_breast1_small/cell_gene_matrices/2023_09_06_11_55_24/expr_mat.csv
Expected runtime (based on our system for the Xenium breast cancer dataset):
The BIDCell output .tif segmentation can be used with Xenium Ranger and then viewed in Xenium Explorer. The .tif file needs to be resized to be the same dimensions as the DAPI image (morphology_mip.ome.tif
):
cells = cv2.resize(cells.astype('float32'), (w_dapi, h_dapi), interpolation=cv2.INTER_NEAREST)
cells = cells.astype(np.uint32)
The resized segmentation can then be used as the input file to the --cells
argument for xeniumranger import-segmentation
. The same applies to the nuclei from BIDCell.
If you have any enquiries, especially about using BIDCell to segment cells in your data, please contact xiaohang.fu@sydney.edu.au. We are also happy to receive any suggestions and comments.
If BIDCell has assisted you with your work, please kindly cite our paper:
Fu, X., Lin, Y., Lin, D., Mechtersheimer, D., Wang, C., Ameen, F., Ghazanfar, S., Patrick, E., Kim, J., & Yang, J. Y. H. BIDCell: Biologically-informed self-supervised learning for segmentation of subcellular spatial transcriptomics data. Nat Commun 15, 509 (2024). https://doi.org/10.1038/s41467-023-44560-w