Difficulty replicating data preprocessing for Zeisel data

Hi there,

This is a pretty minor issue, but I thought I would document it here in case anybody else runs into it, and for my own records in future. Thanks for providing a comprehensive repository!

I tried to replicate the preprocessing for the Zeisel data and I ran into a couple of difficulties with the metadata. When I run the code using the files on the Linarsson site, I run into some issues with the metadata, which messes up the selection of cells, ie:

https://github.com/MarioniLab/RegressionBASiCS2017/blob/fb1d833614e0469db51ec1677cabf66433f5e19e/Preprocessing/Data_preparation.R#L70

dim(Zeisel.bio[, Zeisel.meta[2, ] == 5])
# [1] 10687     0

I can't reconcile this part of the code with the structure I get from the files, which seems for some reason to have been read in as some sort of nested list, and with some rows apparently being different in value. I have no idea if this is due to a change in R, a change in the file hosted on the lab website (modification date is July 17 2016 despite the filename suggesting 17 August 2014), a difference in operating system, or different system/R options (or a combination of these).

In case it's useful for anybody, the code below is self-contained and should work with the current data from the Linarsson lab site. Be aware, it creates and writes to the data directory in the current working directory by default. Cheers

## https://github.com/MarioniLab/RegressionBASiCS2017/blob/fb1d833614e0469db51ec1677cabf66433f5e19e/Preprocessing/Data_preparation.R#L42

data.dir <- "data"
dir.create(data.dir, showWarnings = FALSE)
countsfile <- file.path(data.dir, "zeisel_counts.txt")
spikesfile <- file.path(data.dir, "zeisel_spikes.txt")
system(
  paste(
    "wget https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt",
    "-O", countsfile
  )
)
system(
  paste(
    "wget https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_spikes_17-Aug-2014.txt",
    "-O", spikesfile
  )
)

Zeisel <- read.table(
  countsfile,
  header = FALSE,
  sep = "\t",
  stringsAsFactors = FALSE,
  fill = TRUE
)

# Meta information
Zeisel.meta <- Zeisel[1:11, ]
Zeisel.meta <- Zeisel.meta[, -c(1, 2)]

# Collect biological counts
Zeisel.bio <- Zeisel[12:nrow(Zeisel),]
rownames(Zeisel.bio) <- as.character(Zeisel.bio[, 1])
Zeisel.bio <- Zeisel.bio[, -c(1, 2)]
colnames(Zeisel.bio) <- as.character(Zeisel.meta[8, ])
Zeisel.bio <- data.matrix(Zeisel.bio[rowMeans(data.matrix(Zeisel.bio)) > 0.1, ])

# Spike-in counts
Zeisel.ERCC <- read.table(
  spikesfile,
  header = FALSE,
  sep = "\t",
  stringsAsFactors = FALSE,
  fill = TRUE
)

Zeisel.ERCC <- Zeisel.ERCC[12:nrow(Zeisel.ERCC), ]
rownames(Zeisel.ERCC) <- as.character(Zeisel.ERCC[, 1])
Zeisel.ERCC <- Zeisel.ERCC[, -c(1, 2)]
colnames(Zeisel.ERCC) <- as.character(Zeisel.meta[8, ])
Zeisel.ERCC[] <- sapply(Zeisel.ERCC, as.numeric)

# Look at sizes of cell groups
table(as.character(Zeisel.meta[9, ]))

# Filter out one cell type for model testing - microglia
microglia.ind <- vapply(
  Zeisel.meta[9, ],
  function(x) x == "microglia",
  logical(1)
)
microglia <- Zeisel.bio[, microglia.ind]
ERCC <- Zeisel.ERCC[, colnames(microglia)]
ERCC <- ERCC[rowSums(ERCC) > 0, ]

write.table(rbind(microglia, ERCC), file.path(data.dir, "microglia_Zeisel.txt"), sep = "\t")

# Another cell type for downsampling - CA1
CA1.ind <- vapply(
  Zeisel.meta[9, ],
  function(x) x == "pyramidal CA1",
  logical(1)
)
CA1 <- Zeisel.bio[, CA1.ind]
ERCC <- Zeisel.ERCC[, colnames(CA1)]

write.table(rbind(CA1, ERCC), file.path(data.dir, "CA1_Zeisel.txt"), sep = "\t")

MarioniLab / RegressionBASiCS2017

Difficulty replicating data preprocessing for Zeisel data #1