tutorial processing multi-seq raw fastqs?

chris-mcginnis-ucsf / MULTI-seq

R implementation of MULTI-seq sample classification workflow

59 stars 10 forks source link

tutorial processing multi-seq raw fastqs? #8

Closed adiamb closed 4 years ago

adiamb commented 4 years ago

Hello Chris, We just used the multi-seq 10x for sc-RNAseq in 32 samples, I received back 16 triplet files (R1, R2 & I) from the UCSF core and I have a list of barcodes 1 through 32. Is there a guide starting out from this point? I apologize if this is not the right channel to ask for help, I am starting out with 10x analysis and multi-seq many thanks for this work Aditya

RM-SCB commented 4 years ago

When you generate the fastqs (using cellranger mkfastq for example) you need to split the libraries into 2 based on the indexes used: one will be the cDNA that you feed in cellranger count, and the other one will be your multiseq library. From the cDNA you will obtain the id of the cell barcodes to use for the multiseq pipeline.

The procedure is explained here for multiseq https://github.com/chris-mcginnis-ucsf/MULTI-seq and here for cellranger: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ov

chris-mcginnis-ucsf commented 4 years ago

Thanks @r-mvl -- You first need to pre-process the expression FASTQs in order to define a list of cell IDs that you want to demultiplex. You'll then run the 'MULTIseq.preProcess' command specifying the cell IDs, reference barcodes, and MULTI-seq FASTQs as input arguments (as described in the readme -- see section called "Tutorial: 96-plex HMEC sample multiplexed scRNA-seq"

adiamb commented 4 years ago

Thank you, I aggregated all our libraries (n=16) after processing them through cell ranger and I ended up with cell barcodes with a numerical suffix. for e.g 'TTTGTTGTCTCTCGAC-1' to 'TTTGTTGTCTCTCGAC-16'. Should I remove these suffixes before using the preprocess, also the multiseq fastqs are demultiplexed based on the 10x index barcodes, should I concat all the multi-seq fastqs before I pass them to preprocess. Thank you so much

chris-mcginnis-ucsf commented 4 years ago

Hey Aditya,

Sounds like a pretty epic dataset!

You'll want to pre-process each of the sets of MULTI-seq FASTQs independently (so no concatenation). This will ensure that there are no ambiguous sample classifications due to overlapping cellIDs between the libraries. You'll also want to remove the library flags included from Cell Ranger.

So, for example, if I was processing the first library, I would do the following in R:

cellIDs_1 <- grep("-1", colnames(data), value=T) cellIDs_1 <- unlist(strsplit(cellIDs_1, split="-1")) readTable_1 <- MULTIseq.preProcess(R1 = 'PATH/TO/R1_1.fastq.gz', R2 = 'PATH/TO/R2_1.fastq.gz', cellIDs = cellIDs_1, cell=c(1,16), umi=c(17,28), tag=c(1,8))

This code assumes that your aggregated gene expression data is stored as an R object called 'data', and that you used 10x V3 reagents. If you used V2 reagents, change the umi argument to umi=c(17,26)

Chris