Multiple plates at once

dvbrown commented 3 years ago

Great to see this project. I would like it so multiple plates can be processed in parallel. The output will be a single count matrix or SingleCellExperiment object

mcmero commented 3 years ago

Thanks Daniel. The pipeline is built with parallelisation in mind, so all input fastqs (i.e. plates) can be run in parallel through SLURM. The final output will be a SingleCellExperiment object once the pipeline is complete. I will work on adding documentation soon.

PeteHaitch commented 3 years ago

Just an reminder about some annoying corner cases that means it's not always 1 pair of FASTQ files = 1 plate E.g.,

1 pair of FASTQs for the single cells on the plate and 1 pair of FASTQs for the mini-bulks on the plate
Plate sequenced across multiple sequencing runs -> multiple pairs of FASTQs per plate
Samples prepared in strip tubes and then deposited into a CEL-Seq plate, with different Illumina indexes for each strip tube

mcmero commented 3 years ago

The current design of the pipeline doesn't assume that each fastq is one plate per se, rather, all fastq files are processed in parallel with the merging/consolidation occurring in the final step(s). My thinking for this design is so that we maximise our ability to run in parallel, then consolidate when the data is in its most compact form.

Multiple pairs of fastqs per plate shouldn't pose a problem, although I'm not too sure about the other scenarios. Are there any differences to how we process minibulk data and sc data? Does the strip tube case only affect the BCL demultiplexing? If you can point me to data that we have for these edge cases, that would also be great.

PeteHaitch commented 3 years ago

Are there any differences to how we process minibulk data and sc data?

Main differences I can think of:

For mini-bulk, I usually generate both deduplicated (e.g., https://github.com/WEHISCORE/C086_Kinkel/blob/0666990e822bc86efc580765d4f677ccb4fe31c3/code/scPipe.R#L234-L271) and non-deduplicated (e.g., https://github.com/WEHISCORE/C086_Kinkel/blob/main/code/scPipe.R#L328-L367) UMI counts. That's because sometimes the UMI counts and sometimes the read counts (i.e. non-duplicated UMI counts) have been more appropriate for downstream analysis and it's not always clear at the time of generating the data which it's going to be.
Some mini-bulk projects don't have ERCCs spiked-in. Even when ERCCs aren't there I tend to still map+quantify against a organism+ERCC reference genome, but it can cause minor problems in the scPipe QC report, I think.

Does the strip tube case only affect the BCL demultiplexing?

yeah I think so. It's not strip tube-specific per se but occurs whenever the lab team decide that they want to pool a certain subset of samples together to be able to adjust their relative contribution to the final library (e.g., if some samples have much more RNA then they may want to put less of those samples into the final library).

If you can point me to data that we have for these edge cases, that would also be great.

I've tried to tag all our repos (https://github.com/orgs/WEHISCORE/repositories) by what sort of data they have, so you can search for e.g., mini-bulk-cel-seq2 https://github.com/search?q=topic%3Amini-bulk-cel-seq2+org%3AWEHISCORE&type=Repositories If you look for the 'Key variables' section of the code/scPipe.R script you might find some examples. E.g., a straightforward project with 1 plate and 1 pair of FASTQs is C086_Kinkel and I basically loop over a plates variable (althought it's redundant here with just 1 plate). A more complicated example is C122_Clucas where I've had to instead construct an ids variable to loop over, which is a combination of sequencing_run and RPI.

The logic in the various code/scPipe.R scripts is not very principled :) Basically, initially I assumed each project would have 1 plate = 1 pair of FASTQs and so looped over a plate variable. Then reality hit and kludges ensued :)

mcmero commented 3 years ago

Thanks Pete. I'll have a bit more of a think about how to address complex plate designs in the pipeline.

To address the mini-bulk scenario, I can put in a flag that will cause the pipeline to output non-deduplicated UMI counts (in addition to the default deduplicated counts). I'll also test the QC reports.

My current metadata assumptions are very basic, as the pipeline takes a (formatted by the user) sample sheet in Illumina's format to process the bcl > fastq (we can also run directly from fastq, in which case you don't need this). The second metadata file is the cell ID and corresponding barcode. Once we get some of the sample sheet spreadsheets in a bit more of a standardised format, I'm hoping to integrate that into the pipeline.

PeteHaitch commented 3 years ago

Thanks Pete. I'll have a bit more of a think about how to address complex plate designs in the pipeline.

The main constraint (and something I check in code/scPipe.R) is that there are no duplicated well barcodes within a 'unit' (be it plate, RPI, or some more complex combination of things).

It's great to see this progressing!

mcmero commented 3 years ago

I'm thinking that the most straight forward, first-pass design that handles edge cases is to allow the user to specify a barcode file per pair of fastq files. In the case where no barcodes are duplicated, the user can input a single barcode file (containing all the cell barcodes across plates) and the pipeline will collate everything at the end. In the case of barcode duplication, specifying a barcode file per pair will avoid these conflicts. This puts the onus on the user to set up the barcode files, but in the future I'll write a step to split the barcode files accordingly given a standardised input sample sheet to automate this step. Does this sound reasonable @PeteHaitch?

Also, a few questions:

Regarding reports, do you always want one report per pair of fastqs, or would you want this further collated?
scPipe.R mentions creating the report separately (not from create_sce_by_dir) for more fine-grained control. What extra info does this give you?
For the final collated SCE object, do we need any extra info apart from gene names, cell names and the count matrix? All the extra info from scPipe will be available in each fastq pair's SCE, but wondering whether it's worth collating all the extra info from these objects.

PeteHaitch commented 3 years ago

Sorry, I haven't had time to properly consider this, Marek. I'll do so next week

PeteHaitch commented 2 years ago

Regarding reports, do you always want one report per pair of fastqs, or would you want this further collated?

I think scPipe can only produce a report when there are no duplicated barcodes, so I think creating a report from collated files would require some additional work.

scPipe.R mentions creating the report separately (not from create_sce_by_dir) for more fine-grained control. What extra info does this give you?

It's actually to produce less info. I modify the report template to inject some code that terminates the report early to skip over some steps that were/are prone to failure (and where I couldn't be bothered trying to figure out or fix the source of the error)

For the final collated SCE object, do we need any extra info apart from gene names, cell names and the count matrix? All the extra info from scPipe will be available in each fastq pair's SCE, but wondering whether it's worth collating all the extra info from these objects.

There are some QC metrics that scPipe records in the SingleCellExperiment object returned by scPipe::create_sce_by_dir(); see colData(sce) and metadata(sce) on a SingleCellExperiment object returned by this function. You might be able to figure out how these fields are populated and create a smaller function to replicate this part of the process.

Historically, I have included the (cleaned) sample sheet as the colData of the returned SingleCellExperiment by supplying it to the pheno_data argument in scPipe::create_sce_by_dir(). There might be an argument for this addition of sample metadata to be a distinct part of the pipeline, but it's inclusion in the SingleCellExperiment object is ultimately essential.

mcmero commented 2 years ago

After a few trial runs, here's my solution for running plates in parallel:

multiple pairs of fastqs per plate or multiple plates with no barcode collisions -- the pipeline handles this parallelism automatically and will create one report per pair of fastqs.
multiple plates with barcode collisions -- separate out these fastqs and run the pipelines under different directories. Here I would run up to the index demultiplexing (if running from bcls) using the snakemake --until flag (--until multiQC in this case), then run from fastqs separately.

mcmero commented 2 years ago

I've added a better way (511f3fd) to run multiple plates (the pipeline calls them samples), each with their own barcode file. You can now specify the {sample} wildcard in the sample_sheet and barcode_file parameters in the config.yaml (note: you have to do this for both parameters, otherwise it won't work). This is a much easier way to run in parallel where each plate requires its own metadata and barcode info.

Note that your sample_sheets must exist per sample and match the sample names of your fastq files (maybe I can automate splitting sample sheets at some point). Your barcode files don't have to exist (they will be generated by the pipeline), you just need a name template.

Example:

sample_sheet: metadata/scpipe_{sample}_sample_sheet.csv

barcode_file: metadata/{sample}_barcode_anno.csv

WEHISCORE / CELSeq-pipeline

Multiple plates at once #1