Version 3.0
of this repository contains the code to reproduce the analysis presented in our Genome Biology paper on UMI data normalization (Lause, Berens & Kobak, 2021) and the corresponding preprint (v3). The code used for versions v1 and v2 of the paper is available under the tags 1.0
and 2.0
in this repository.
To start, follow these steps:
tools.py
and adapt the three import paths as neededThen, you can step through our full analysis by simply following the sequence of the notebooks. If you want to reproduce only parts of our analysis, there are six independent analysis pipelines that you can run individually:
01
& 02
, producing Figure 1 from our paper)01
& 03
, producing Figure S1)01
, 041
, 042
, 05
, producing Figures 2, S2, S4, S5, and additional figures)06
, 07
, 081
, producing Figures 3, S3, and additional figures)101
, 102
, producing Figures 5 and S7)091
, producing Figures 4 and S6, and additional figures)06
, 07
, 081
and 082
for retina datasets, 091
and 092
for the organogenesis dataset and 101
and 103
for the benchmarking FACS-sorted PBMCs, producing additional figures). These pipelines will require you to run Sanity from the command line; see notebooks 082
and 092
for instructions.Note that 041
and 101
are R notebooks, the remaining are Python notebooks.
Each of the analyses will first preprocess and filter the datasets. Next, computationally expensive tasks are done (NB regression fits, GLM-PCA, t-SNE, simulations of negative control data, ..) and the results are saved as files. For some analyses, this is done in separate notebooks. Finally, the results files are loaded for plotting (again in separate notebooks for some analyses).
We recommend to run the code on a powerful machine with at least 250 GB RAM.
For questions or feedback, feel free to use the issue system or email us.
We used the following software environments:
3.8.0
7.21.0
numpy
1.20.1
pandas
1.2.0
scipy
1.6.0
seaborn
0.11.1
matplotlib
3.3.3
statsmodels
0.12.2
sklearn
0.24.0
anndata
0.7.5
from https://anndata.readthedocs.ioscanpy
1.7.1
from https://scanpy.readthedocs.ioFIt-SNE
by George C. Linderman from https://github.com/KlugerLab/FIt-SNE, using this version https://github.com/KlugerLab/FIt-SNE/tree/47ff14f1defc1ff3a8065c8b7baaf45c33e7e0b2FFTW
3.3.8
from http://www.fftw.orgrpy2
3.4.2
from https://rpy2.github.io/ glmpca-py
by Will Townes from https://github.com/willtownes/glmpca-py/, using this version:
https://github.com/willtownes/glmpca-py/tree/a6fc417b08ab5bc21d8ac9e197f4f5518d093385rna-seq-tsne
by Dmitry Kobak from https://github.com/berenslab/rna-seq-tsne, using this version: https://github.com/berenslab/rna-seq-tsne/tree/21e3601782d37dd3f0c8e02ed9f239b005c4100f3.6.3 (2020-02-29)
glmpca
0.2.0
by Will Townes from https://github.com/willtownes/glmpca, https://github.com/willtownes/glmpca/releases/tag/v0.2.0. If not present in the local R version, our code will try to install the most recent version from CRAN.SingleCellExperiment
1.8.0
MASS
7.3-53.1
sctransform
0.3.2
by Christoph Hafemeister (https://github.com/ChristophH/sctransform, The full R environment used was
attached base packages:
parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
MASS_7.3-53.1 sctransform_0.3.2 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1
DelayedArray_0.12.3 BiocParallel_1.20.1 matrixStats_0.58.0 Biobase_2.46.0
GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2 S4Vectors_0.24.4
BiocGenerics_0.32.0 glmpca_0.2.0
loaded via a namespace (and not attached):
tidyselect_1.1.0 listenv_0.8.0 purrr_0.3.4 reshape2_1.4.4 lattice_0.20-41 colorspace_2.0-0
vctrs_0.3.7 generics_0.1.0 utf8_1.2.1 rlang_0.4.10 pillar_1.6.0 glue_1.4.2
DBI_1.1.1 GenomeInfoDbData_1.2.2 lifecycle_1.0.0 plyr_1.8.6 stringr_1.4.0 zlibbioc_1.32.0
munsell_0.5.0 gtable_0.3.0 future_1.21.0 codetools_0.2-18 fansi_0.4.2 Rcpp_1.0.6
scales_1.1.1 XVector_0.26.0 parallelly_1.24.0 gridExtra_2.3 ggplot2_3.3.3 digest_0.6.27
stringi_1.5.3 dplyr_1.0.5 grid_3.6.3 tools_3.6.3 bitops_1.0-6 magrittr_2.0.1
RCurl_1.98-1.3 tibble_3.1.1 crayon_1.4.1 future.apply_1.7.0 pkgconfig_2.0.3 ellipsis_0.3.1
Matrix_1.3-2 assertthat_0.2.1 R6_2.5.0 globals_0.14.0 compiler_3.6.3
All accession numbers can also be found in Table S2 of our paper.
genes.tsv
and matrix.mtx
to umi-normalization/datasets/33k_pbmc/
analysis
to umi-normalization/datasets/33k_pbmc/
as wellsvensson_chromium_control.h5ad
(18.82 MB)umi-normalization/datasets/10x/
GSE65525
*.csv.bz2
file for the sample GSM1599501
(human K562 pure RNA control, 953 samples, 5.1 MB))GSM1599501_K562_pure_RNA.csv
to umi-normalization/datasets/indrop/
GSE108097
GSM2906413
and download GSM2906413_EmbryonicStemCell_dge.txt.gz
(EmbryonicStemCell.E14, 7.9 MB)umi-normalization/datasets/microwellseq/
GSE63472
GSE63472_P14Retina_merged_digital_expression.txt.gz
(50.7 MB)GSE63472_P14Retina_merged_digital_expression.txt
umi-normalization/datasets/retina/macosko_all
umi-normalization/datasets/retina/macosko_all/
GSE81904
GSE81904_BipolarUMICounts_Cell2016.txt.gz
(42.9 MB)umi-normalization/datasets/retina/shekhar_bipolar/
Download
clust_retinal_bipolar.txt
(1.5 MB)umi-normalization/datasets/retina/shekhar_bipolar/
GSE133382
GSE133382_AtlasRGCs_CountMatrix.csv.gz
(129.3 MB)GSE133382_AtlasRGCs_CountMatrix.csv
umi-normalization/datasets/retina/tran_ganglion/
RGC_Atlas.csv
(1.05 GB) and RGC_Atlas_coordinates.txt
(927 KB)umi-normalization/datasets/retina/tran_ganglion/
gene_count.txt
(18 GB), gene_annotation.csv
(1.1 MB) and cell_annotation.csv
(828 MB)umi-normalization/datasets/cao
SingleCellExperiment
installedDuoClustering2018.tar.gz
(4.93 GB)umi-normalization/datasets
sce_full_Zhengmix8eq.rds
exists at umi-normalization/datasets/DuoClustering2018/sce_full/