Closed dvbrown closed 2 years ago
Thanks Daniel. The pipeline is built with parallelisation in mind, so all input fastqs (i.e. plates) can be run in parallel through SLURM. The final output will be a SingleCellExperiment object once the pipeline is complete. I will work on adding documentation soon.
Just an reminder about some annoying corner cases that means it's not always 1 pair of FASTQ files = 1 plate E.g.,
The current design of the pipeline doesn't assume that each fastq is one plate per se, rather, all fastq files are processed in parallel with the merging/consolidation occurring in the final step(s). My thinking for this design is so that we maximise our ability to run in parallel, then consolidate when the data is in its most compact form.
Multiple pairs of fastqs per plate shouldn't pose a problem, although I'm not too sure about the other scenarios. Are there any differences to how we process minibulk data and sc data? Does the strip tube case only affect the BCL demultiplexing? If you can point me to data that we have for these edge cases, that would also be great.
Are there any differences to how we process minibulk data and sc data?
Main differences I can think of:
Does the strip tube case only affect the BCL demultiplexing?
yeah I think so. It's not strip tube-specific per se but occurs whenever the lab team decide that they want to pool a certain subset of samples together to be able to adjust their relative contribution to the final library (e.g., if some samples have much more RNA then they may want to put less of those samples into the final library).
If you can point me to data that we have for these edge cases, that would also be great.
I've tried to tag all our repos (https://github.com/orgs/WEHISCORE/repositories) by what sort of data they have, so you can search for e.g., mini-bulk-cel-seq2
https://github.com/search?q=topic%3Amini-bulk-cel-seq2+org%3AWEHISCORE&type=Repositories
If you look for the 'Key variables' section of the code/scPipe.R
script you might find some examples.
E.g., a straightforward project with 1 plate and 1 pair of FASTQs is C086_Kinkel and I basically loop over a plates
variable (althought it's redundant here with just 1 plate).
A more complicated example is C122_Clucas where I've had to instead construct an ids
variable to loop over, which is a combination of sequencing_run
and RPI
.
The logic in the various code/scPipe.R
scripts is not very principled :)
Basically, initially I assumed each project would have 1 plate = 1 pair of FASTQs and so looped over a plate
variable.
Then reality hit and kludges ensued :)
Thanks Pete. I'll have a bit more of a think about how to address complex plate designs in the pipeline.
To address the mini-bulk scenario, I can put in a flag that will cause the pipeline to output non-deduplicated UMI counts (in addition to the default deduplicated counts). I'll also test the QC reports.
My current metadata assumptions are very basic, as the pipeline takes a (formatted by the user) sample sheet in Illumina's format to process the bcl > fastq (we can also run directly from fastq, in which case you don't need this). The second metadata file is the cell ID and corresponding barcode. Once we get some of the sample sheet spreadsheets in a bit more of a standardised format, I'm hoping to integrate that into the pipeline.
Thanks Pete. I'll have a bit more of a think about how to address complex plate designs in the pipeline.
The main constraint (and something I check in code/scPipe.R
) is that there are no duplicated well barcodes within a 'unit' (be it plate
, RPI
, or some more complex combination of things).
It's great to see this progressing!
I'm thinking that the most straight forward, first-pass design that handles edge cases is to allow the user to specify a barcode file per pair of fastq files. In the case where no barcodes are duplicated, the user can input a single barcode file (containing all the cell barcodes across plates) and the pipeline will collate everything at the end. In the case of barcode duplication, specifying a barcode file per pair will avoid these conflicts. This puts the onus on the user to set up the barcode files, but in the future I'll write a step to split the barcode files accordingly given a standardised input sample sheet to automate this step. Does this sound reasonable @PeteHaitch?
Also, a few questions:
scPipe.R
mentions creating the report separately (not from create_sce_by_dir
) for more fine-grained control. What extra info does this give you?Sorry, I haven't had time to properly consider this, Marek. I'll do so next week
- Regarding reports, do you always want one report per pair of fastqs, or would you want this further collated?
I think scPipe can only produce a report when there are no duplicated barcodes, so I think creating a report from collated files would require some additional work.
- scPipe.R mentions creating the report separately (not from create_sce_by_dir) for more fine-grained control. What extra info does this give you?
It's actually to produce less info. I modify the report template to inject some code that terminates the report early to skip over some steps that were/are prone to failure (and where I couldn't be bothered trying to figure out or fix the source of the error)
- For the final collated SCE object, do we need any extra info apart from gene names, cell names and the count matrix? All the extra info from scPipe will be available in each fastq pair's SCE, but wondering whether it's worth collating all the extra info from these objects.
There are some QC metrics that scPipe records in the SingleCellExperiment object returned by scPipe::create_sce_by_dir()
; see colData(sce)
and metadata(sce)
on a SingleCellExperiment object returned by this function.
You might be able to figure out how these fields are populated and create a smaller function to replicate this part of the process.
Historically, I have included the (cleaned) sample sheet as the colData of the returned SingleCellExperiment by supplying it to the pheno_data
argument in scPipe::create_sce_by_dir()
.
There might be an argument for this addition of sample metadata to be a distinct part of the pipeline, but it's inclusion in the SingleCellExperiment object is ultimately essential.
After a few trial runs, here's my solution for running plates in parallel:
--until
flag (--until multiQC
in this case), then run from fastqs separately.I've added a better way (511f3fd) to run multiple plates (the pipeline calls them samples), each with their own barcode file. You can now specify the {sample}
wildcard in the sample_sheet
and barcode_file
parameters in the config.yaml
(note: you have to do this for both parameters, otherwise it won't work). This is a much easier way to run in parallel where each plate requires its own metadata and barcode info.
Note that your sample_sheets must exist per sample and match the sample names of your fastq files (maybe I can automate splitting sample sheets at some point). Your barcode files don't have to exist (they will be generated by the pipeline), you just need a name template.
Example:
sample_sheet: metadata/scpipe_{sample}_sample_sheet.csv
barcode_file: metadata/{sample}_barcode_anno.csv
Great to see this project. I would like it so multiple plates can be processed in parallel. The output will be a single count matrix or SingleCellExperiment object