Gartner-Lab / deMULTIplex2

deMULTIplex2: robust sample demultiplexing for scRNA-seq
Creative Commons Attribution 4.0 International
10 stars 0 forks source link

Apply deMULTIplex2 for sci-Plex? #89

Closed YijiaChi closed 6 months ago

YijiaChi commented 6 months ago

Hi, For sci-Plex, the cell barcode position is not a fixed value. R1: { 9-10 bp hairpin barcode }-{CAGAGC}-{ 8bp UMI }--{10 bp RT barcode}

How could I generate the readTable for this read structure?

Thanks for your excellent R package! It is such a great help for bioinformatic beginner like me!! Any reply would be really appreciated!

dannyconrad commented 6 months ago

Hi Yijia, we're glad you like our tool! I personally don't have experience with sci-Plex and the readTags() function was not designed to handle such a different barcode structure, but it may be possible. My understanding is that the RT and hairpin barcode together make up the "cell barcode". Your "sample barcode", i.e. the hashing oligo used in the experiment, should be in R2 then? If so, try running readTags() like this:

readTable <- readTags(
  barcode.type = "MULTIseq", # this is to make sure it extracts both the cell barcode and UMI from R1 - a quirk of how I wrote it
  assay = NULL,
  filter.cells = NULL,
  fastq.A = "/path/to/R1.fastq.gz",
  fastq.B = "/path/to/R2.fastq.gz",
  pos.cbc = c(1,34),
  pos.sbc = c(start,end), # input the appropriate indices of your hash barcode in R2 here
  pos.umi = c(1,34)
)

If this succeeds (it will probably throw an error that barcode is not correct) and your readTable is populated with actual sequences (Cell and UMI should be identical full length reads), then I would follow this up with a combination of gsub() commands to extract the correct sequence for the cell barcodes and UMIs:

readTable$Cell <- gsub("CAGAGC.{8}", "", readTable$Cell) # this should remove the UMI and constant region CAGAGC and leave the concatenated hairpin and 10bp RT barcode
readTable$UMI <- gsub(".*?CAGAGC(.{8}).*", "\\1", readTable$UMI) # this should keep just the UMI

I assume your R1 reads will not be variable length (i.e. all 34 bases) even if the content you wish to extract within is variable (33-34 bases). This means there will be an extra base appended to the cell barcodes that used a 9-base hairpin but I don't think this should cause any problems as long as you're aware of it.

As an aside, since you'll likely want to pair your sample demultiplexing results with gene expression analysis of your sci-RNA-seq library, you'll probably need to confirm how the cell barcodes are generated in that analysis pipeline. For example, if the RT and hairpin sequences are concatenated in the opposite order or the 9-base hairpins have that extra base trimmed, you'll need to make adjustments so that once you have your deMULTIplex2 results, your sample assignments can be transferred over to your other analysis.

Let me know if that helps!

YijiaChi commented 6 months ago

Thank you very much for your detailed explanation and prompt response!! Your patient guidance has been incredibly helpful, and I truly appreciate the time and effort you've taken to assist me!!!

Attached is the script customized for sci-Plex, building upon your function:

readTags <- function (fastq.A = NA, fastq.B = NA, 
                      pos.cbc,  pos.sbc, pos.umi,
                      filter.cells = NULL, ...) 
{
  t0 <- Sys.time()

  if (length(fastq.A) < 1 | length(fastq.B) < 1) {
    return(message("Error - one or more FASTQ files not found"))
  }
  else if (length(fastq.A) > 1 | length(fastq.B) > 1) {
    return(message("Error - too many files with match names found"))
  }

  cat("### Building Read Table ###", fill = T)

  cat("Loading first set of reads...", fill = T)
  r <- readFastq(fastq.A)
  gc(verbose = F)

  read_table <- data.frame(Cell = subseq(sread(r),pos.cbc[1], pos.cbc[2]), 
                           UMI = subseq(sread(r), pos.umi[1], pos.umi[2]))

  read_table$Cell <- gsub("CAGAGC.{8}", "_", read_table$Cell) # this should remove the UMI and constant region CAGAGC and leave the concatenated hairpin and 10bp RT barcode
  read_table$UMI <- gsub(".*?CAGAGC(.{8}).*", "\\1", read_table$UMI) # this should keep just the UMI

  hp_check <- substr(x = read_table$Cell,start = 10,stop=10)

  # 34bp :9{6+8}10+T ;10{6+8}10
  # to be consistant with star_solo_output cell_id :[9 or 10 bp] hp_rtbc
  # if hairpin=9bp ,remove the last nucleutide of cell bc

  tmp <- read_table$Cell[hp_check== "_"]
  read_table$Cell[hp_check== "_"]<- substr(x = tmp,
                                           start = 1,
                                           stop = nchar(tmp)-1)
  tmp<-NULL

  cat("Finished processing first set of reads; unloading from memory", 
      fill = T)
  r <- NULL
  gc(verbose = F)
  cat("Loading second set of reads...", fill = T)
  r <- readFastq(fastq.B)
  gc(verbose = F)

  read_table$Sample <- as.character(subseq(sread(r), pos.sbc[1],pos.sbc[2]))

  cat("Finished processing second set of reads; unloading from memory", 
      fill = T)
  r <- NULL
  gc(verbose = F)
  cat("Finished parsing ", nrow(read_table), " read pairs", 
      sep = "", fill = T)
  if (is.null(filter.cells)) {
    cat("Keeping all reads because filter.cells = NULL", 
        fill = T)
  }
  else if (!is.null(filter.cells)) {
    cat("Filtering for ", length(filter.cells), " provided cell barcodes...", 
        sep = "", fill = T)
    ind <- which(read_table$Cell %in% filter.cells)
    read_table <- read_table[ind, ]
  }
  read_table <- read_table[, c("Cell", "Sample", 
                               "UMI")]
  cat("Finished building read table", fill = T)
  cat("\n")
  cat("Finished in", round(difftime(Sys.time(), t0, units = "mins")[[1]], 
                           1), "minutes", sep = " ")
  cat("\n")
  return(read_table)
}

}
readTable <- readTags(
  #filter.cells = NULL,
  fastq.A = R1,
  fastq.B = R2,
  pos.cbc = c(1,34),
  pos.sbc = c(1,10), # input the appropriate indices of your hash barcode in R2 here
  pos.umi = c(1,34)
)

dim(all_read_table)
keep <- (nchar(all_read_table$Cell) %in%  c(23,24))
sum(keep)
all_read_table <- all_read_table[sum(keep) ,]
#dir.create(exp)
write.table(all_read_table,file = paste0(exp,"/",exp,"_01_read_seq_table.txt"),
            quote = F,sep = "\t",row.names = F)

## Step2 : Aign the readtable to the reference bc =====

tag_mtx <- alignTags(all_read_table, ref_sg_seq)

tag_mtx[1:4,1:4]  # cell_bc * sg

summary(rowSums(tag_mtx))
summary(colSums(tag_mtx))

# filter low quality BC
cell_ids <- Matrix::rowSums(tag_mtx) >= 5  # umi_detected_in_a_cell: filter cells
tag_used <- Matrix::colSums(tag_mtx) >= 20   # umi_detected_in_a_sg  {expected cell/num of sg: 100}

sum(cell_ids)  # {~ expected cell number}
sum(tag_used)

tag_mtx <- tag_mtx[cell_ids,tag_used]

tag_df<- tag_mtx%>%as.matrix()%>%as.data.frame()%>%
  mutate(CBC=rownames(.))%>%select(CBC,everything())

write.table(tag_df,file = paste0(exp,"/",exp,"_02_sgRNA_CBC_mtx.txt"),
            quote = F,sep = "\t",row.names = F)

tag_df


# Step3 : Assign the sample identity to cell  [ho -> cbc] ====

res <- demultiplexTags(tag_mtx,
                       plot.path = paste0(exp),
                       plot.name = exp,
                       plot.diagnostics = T)
table(res$final_assign)

assign <- as.data.frame(res$assign_table)%>%
  mutate(CBC=rownames(.))

write.table(assign,file = paste0(exp,"/",exp,"_03_final_assign.txt"),
            quote = F,sep = "\t",row.names = F)

tagHist(tag_mtx = tag_mtx, 
        minUMI = 10, 
        plotnUMI = F)
ggsave(paste0(exp,"/","tagHist.pdf"))

pdf(paste0(exp,"/","tagCallHeatmap.pdf"))
tagCallHeatmap(tag_mtx = tag_mtx, 
               calls = res$final_assign,
               log=F)

dev.off()

# Error occur im the situation that 
        # one of the calltypes has only 1 cell
        # the demultiplexed hash oligo is less than the refenrence hash oligo

# calltypes <- unique(calls[duplicated(calls)])
# tmp <- tmp[, order]

 corrected  function TagHist() ( with bad my exp)

tagCallHeatmap

qinzhu commented 6 months ago

Thanks Yijia for sharing your code with us! We have never benchmarked deMULTIplex2 on sci-Plex data and we do not know how well the model fits the data. So please check the diagnostic plots output by deMULTIplex2. Let us know if anything weird shows up (does the distribution plot look similar to that in the paper? does the model fitting fail for some of your samples?).