Replicate results from the paper with data uploaded to GEO

JackieMium commented 1 year ago

Hi! I am really interested in the paper Comprehensive single-cell transcriptional profiling defines shared and unique epithelial injury responses during kidney fibrosis . So I downloaded all files at GSE190887 and also [meta_cell_type_sample.zip] from this repo at meta_cell_type_sample.zip .

Here is the R code I used to read all raw data and tried to construct a Seurat Object:

suppressMessages({
    library("Seurat")
    library("Matrix")
  }
)

setwd("/path/to/projects/")
wdir <- "GSE190887_CellMet_UIRI_UUO"
odir <- paste0(wdir, "/res_data/")
idir <- paste0(wdir, "/raw_data/")  # where all the files mentioned above reside

seurat_obj <- read.csv(paste0(idir, "GSM5733424_count.MM.txt"), header = FALSE)
cat("\nSaving raw count data ... \n")

cell <- read.csv(paste0(idir, "GSE190887_cell_annotate.txt.gz"), header = FALSE)
ct <- read.csv(paste0(idir, "GSE190887_meta_cell_type_sample.csv.gz"), row.names = 1, header = TRUE)
gene <- read.csv(paste0(idir, "GSE190887_gene_name_annotate.txt.gz"), header = FALSE)

cat("\nMaking a MM from count table:\n")
mm0 <- sparseMatrix(i = seurat_obj[1:nrow(gene), 1],
                                   j = seurat_obj[1:nrow(gene), 2],
                                   x = seurat_obj[1:nrow(gene), 3],
                                  dims = c(nrow(gene), nrow(cell)))

# combine each odd (intron counts) and even (exon counts) row pair into one
# so that counts for each gene represent the sum of intron + exon 
mm <- mm0[seq(1, nrow(gene), 2) ] + mm0[seq(2, nrow(gene), 2), ]

colnames(mm) <- cell$V1  # cell barcodes
mm <- mm[, colnames(mm) %in% rownames(ct)] # subset cells to all cells in GSE190887_meta_cell_type_sample.csv

gene <- gene[seq(1, nrow(gene), 2), ]  # drop xxx_intro
dup <- duplicated(gene$V4)   # I want gene symbol for rownames so I dropped all duplicated (132/48785) gene symbols
cat("\nHow many dup gene symbols: \n")
mm <- mm[!dup, ]
rownames(mm) <- gene$V4[!dup]

seurat_obj <- CreateSeuratObject(counts = mm)  # no filter thresholds applied

ct <- ct[match(colnames(mm), rownames(ct)), ]
seurat_obj@meta.data$orig.ident <- ct$sample
seurat_obj@meta.data$celltype <- ct$celltype0421

seurat_obj <- PercentageFeatureSet(seurat_obj, "^mt-", col.name = "percent_mito")
seurat_obj <- PercentageFeatureSet(seurat_obj, "^Rl[sl]", col.name = "percent_ribo")
seurat_obj <- PercentageFeatureSet(seurat_obj, "^Hb[ab]", col.name = "percent_hemo")
seurat_obj$log10GenesPerUMI <- log10(seurat_obj$nFeature_RNA) / log10(seurat_obj$nCount_RNA)

saveRDS(seurat_obj, paste0(odir, "010-99.SO_RAW.rds"))

But when I check the resulting seurat_obj:

# total counts detected in a cell, same as nCount_RNA in meta.data
csum <- colSums(seurat_obj@assays$RNA@counts)
# total counts of each gene across all cells
rsum <- rowSums(seurat_obj@assays$RNA@counts)

table(rsum == 0)
#FALSE  TRUE 
#   8820 39843 
table(csum == 526)
# FALSE   TRUE 
#      258 309408

it seems like there are only 18% (8820 / 48663) genes detected across all cells, and there are 99.91% (309408 / 309666) cells with a total read counts of 526. Given the extremely low detection rates of genes and almost all cells with a count of exactly 526, I am suspecting there must be something wrong. Any hint will be much appreciated!

HaikuoLi commented 1 year ago

Hi,

Thanks for your interest in our work!

The orginal sci-RNA-seq3 paper (Cao et al) actually released their code preprocessing the type of data: https://github.com/JunyueC/sci-RNA-seq3_pipeline/blob/master/gene_count_processing_sciRNAseq.R

It should also work on the data files we released on GEO. Could you have a try?

Best, Haikuo

JackieMium commented 1 year ago

Hi,

Thanks for your interest in our work!

The orginal sci-RNA-seq3 paper (Cao et al) actually released their code preprocessing the type of data: https://github.com/JunyueC/sci-RNA-seq3_pipeline/blob/master/gene_count_processing_sciRNAseq.R

It should also work on the data files we released on GEO. Could you have a try?

Best, Haikuo

Thanks for the heads up. I've carefully reviewed my code and spotted some mistakes. Now I am getting a decent SeuratObject for downstream analysis. The code above was corrected as:

suppressMessages({
    library("Seurat")
    library("Matrix")
  }
)

setwd("/path/to/projects/")
wdir <- "GSE190887_CellMet_UIRI_UUO"
odir <- paste0(wdir, "/res_data/")
idir <- paste0(wdir, "/raw_data/")  # where all the files mentioned above reside

seurat_obj <- read.csv(paste0(idir, "GSM5733424_count.MM.txt"), header = FALSE)
cat("\nSaving raw count data ... \n")

cell <- read.csv(paste0(idir, "GSE190887_cell_annotate.txt.gz"), header = FALSE)
ct <- read.csv(paste0(idir, "GSE190887_meta_cell_type_sample.csv.gz"), row.names = 1, header = TRUE)
gene <- read.csv(paste0(idir, "GSE190887_gene_name_annotate.txt.gz"), header = FALSE)

cat("\nMaking a MM from count table:\n")
mm0 <- sparseMatrix(i = seurat_obj$V1,
                    j = seurat_obj$V2,
                    x = seurat_obj$V3,
                    dims = c(nrow(gene), nrow(cell)))

# combine each odd (intron counts) and even (exon counts) row pair into one
# so that counts for each gene represent the sum of intron + exon 
mm <- mm0[seq(1, nrow(gene), 2), ] + mm0[seq(2, nrow(gene), 2), ]

colnames(mm) <- cell$V1  # cell barcodes
mm <- mm[, colnames(mm) %in% rownames(ct)] # subset cells to all cells in GSE190887_meta_cell_type_sample.csv

gene <- gene[seq(1, nrow(gene), 2), ]  # drop xxx_intro
dup <- duplicated(gene$V4)   # I want gene symbol for rownames so I dropped all duplicated (132/48785) gene symbols
cat("\nHow many dup gene symbols: \n")
mm <- mm[!dup, ]
rownames(mm) <- gene$V4[!dup]

seurat_obj <- CreateSeuratObject(counts = mm)  # no filter thresholds applied

ct <- ct[match(colnames(mm), rownames(ct)), ]
seurat_obj@meta.data$orig.ident <- ct$sample
seurat_obj@meta.data$celltype <- ct$celltype0421

seurat_obj <- PercentageFeatureSet(seurat_obj, "^mt-", col.name = "percent_mito")
seurat_obj <- PercentageFeatureSet(seurat_obj, "^Rp[sl]", col.name = "percent_ribo")
seurat_obj <- PercentageFeatureSet(seurat_obj, "^Hb[ab]", col.name = "percent_hemo")
seurat_obj$log10GenesPerUMI <- log10(seurat_obj$nFeature_RNA) / log10(seurat_obj$nCount_RNA)

saveRDS(seurat_obj, paste0(odir, "010-99.SO_RAW.rds"))

TheHumphreysLab / sci-RNA-seq-kidney

Replicate results from the paper with data uploaded to GEO #2