Open JackieMium opened 1 year ago
Hi,
Thanks for your interest in our work!
The orginal sci-RNA-seq3 paper (Cao et al) actually released their code preprocessing the type of data: https://github.com/JunyueC/sci-RNA-seq3_pipeline/blob/master/gene_count_processing_sciRNAseq.R
It should also work on the data files we released on GEO. Could you have a try?
Best, Haikuo
Hi,
Thanks for your interest in our work!
The orginal sci-RNA-seq3 paper (Cao et al) actually released their code preprocessing the type of data: https://github.com/JunyueC/sci-RNA-seq3_pipeline/blob/master/gene_count_processing_sciRNAseq.R
It should also work on the data files we released on GEO. Could you have a try?
Best, Haikuo
Thanks for the heads up. I've carefully reviewed my code and spotted some mistakes. Now I am getting a decent SeuratObject for downstream analysis. The code above was corrected as:
suppressMessages({
library("Seurat")
library("Matrix")
}
)
setwd("/path/to/projects/")
wdir <- "GSE190887_CellMet_UIRI_UUO"
odir <- paste0(wdir, "/res_data/")
idir <- paste0(wdir, "/raw_data/") # where all the files mentioned above reside
seurat_obj <- read.csv(paste0(idir, "GSM5733424_count.MM.txt"), header = FALSE)
cat("\nSaving raw count data ... \n")
cell <- read.csv(paste0(idir, "GSE190887_cell_annotate.txt.gz"), header = FALSE)
ct <- read.csv(paste0(idir, "GSE190887_meta_cell_type_sample.csv.gz"), row.names = 1, header = TRUE)
gene <- read.csv(paste0(idir, "GSE190887_gene_name_annotate.txt.gz"), header = FALSE)
cat("\nMaking a MM from count table:\n")
mm0 <- sparseMatrix(i = seurat_obj$V1,
j = seurat_obj$V2,
x = seurat_obj$V3,
dims = c(nrow(gene), nrow(cell)))
# combine each odd (intron counts) and even (exon counts) row pair into one
# so that counts for each gene represent the sum of intron + exon
mm <- mm0[seq(1, nrow(gene), 2), ] + mm0[seq(2, nrow(gene), 2), ]
colnames(mm) <- cell$V1 # cell barcodes
mm <- mm[, colnames(mm) %in% rownames(ct)] # subset cells to all cells in GSE190887_meta_cell_type_sample.csv
gene <- gene[seq(1, nrow(gene), 2), ] # drop xxx_intro
dup <- duplicated(gene$V4) # I want gene symbol for rownames so I dropped all duplicated (132/48785) gene symbols
cat("\nHow many dup gene symbols: \n")
mm <- mm[!dup, ]
rownames(mm) <- gene$V4[!dup]
seurat_obj <- CreateSeuratObject(counts = mm) # no filter thresholds applied
ct <- ct[match(colnames(mm), rownames(ct)), ]
seurat_obj@meta.data$orig.ident <- ct$sample
seurat_obj@meta.data$celltype <- ct$celltype0421
seurat_obj <- PercentageFeatureSet(seurat_obj, "^mt-", col.name = "percent_mito")
seurat_obj <- PercentageFeatureSet(seurat_obj, "^Rp[sl]", col.name = "percent_ribo")
seurat_obj <- PercentageFeatureSet(seurat_obj, "^Hb[ab]", col.name = "percent_hemo")
seurat_obj$log10GenesPerUMI <- log10(seurat_obj$nFeature_RNA) / log10(seurat_obj$nCount_RNA)
saveRDS(seurat_obj, paste0(odir, "010-99.SO_RAW.rds"))
Hi! I am really interested in the paper Comprehensive single-cell transcriptional profiling defines shared and unique epithelial injury responses during kidney fibrosis . So I downloaded all files at GSE190887 and also [meta_cell_type_sample.zip] from this repo at meta_cell_type_sample.zip .
Here is the R code I used to read all raw data and tried to construct a Seurat Object:
But when I check the resulting
seurat_obj
:it seems like there are only 18% (8820 / 48663) genes detected across all cells, and there are 99.91% (309408 / 309666) cells with a total read counts of 526. Given the extremely low detection rates of genes and almost all cells with a count of exactly 526, I am suspecting there must be something wrong. Any hint will be much appreciated!