Best practice for pipelines

DavidsonGroup / flexiplex

The Flexible Demultiplexer

https://davidsongroup.github.io/flexiplex/

MIT License

26 stars 2 forks source link

Best practice for pipelines #48

Open ljwharbers opened 3 days ago

ljwharbers commented 3 days ago

Hi all,

I am working on an (nf-core) pipeline (either a separate pipeline for single-cell long-read DNA barcoding. Or to include it in the current https://nf-co.re/scnanoseq/1.0.0/ to support different types of barcoding and both support DNA and RNA.

I was wondering what you would consider to be the go-to method of using flexiplex. Currently, how I have it implemented is running it twice and including the flexiplex-filter in between.

Run flexiplex in discovery mode without giving a whitelist.
Run flexiplex-filter to get a list of 'known' barcodes (with or without a provided whitelist of users)
Run flexiplex with -k ${known_barcodes to get the actual barcodes (and potentially UMIs)

I'm more than happy to further discuss what you would consider to be the go-to way of doing this in a workflow.

Cheers, Luuk

nadiadavidson commented 3 days ago

Hi Luuk,

Yes, that's how we run it. If you already know the barcodes, e.g the non-empty drops from short-read data you could skip to step 3. But if you only have a large whitelist of possible barcodes like the 3 million from 10x (or nothing at all), you run 1. and 2. first.

Other options might need to be added to specify the barcode structure and flanking sequence if different from the default (RNA- 10x v3 3'). But we'd be happy to help work this out with you if needed.

Please let us know if you have any more questions.

Cheers, Nadia.

ljwharbers commented 2 days ago

Hi Nadia,

Thanks for the response!

Good to know, then I'll be going ahead with this setup. Regarding the barcodes, currently I have it as a switch statement in the nextflow module configuration. Not sure if you're familiar with nf-core modules and how their configuration looks like, but it will look like the following:

// FLEXIPLEX
process {
    withName: '.*RUN_FLEXIPLEX:FLEXIPLEX_DISCOVERY' {
        ext.args = {
            def barcodeArg = ""
            switch (params.barcode_format) {
                case "10x_atac":
                    barcodeArg = '-x "ACCGAGATCTACAC" -b "????????????????" -x "CGCGTCTGTCGTCGGCAGCGTCAGATGTGTATAAGAGACAG" -f 8'
                    break
                case "10x3v2":
                    barcodeArg = "-d 10x3v2"
                    break
            }
            barcodeArg.trim()
        }
    }
}

In this way it should be really easy for me to add any type of barcoding strategy to the pipeline while not needing to specify each time all the different flanking regions as separate parameters.

Thanks again and I'll keep you updated if we have it implemented somewhere. Meanwhile, feel free to close this :)

Cheers, Luuk

nadiadavidson commented 2 days ago

Hi Luuk,

Nice to see how it fits into the pipeline, and the ATAC flanking sequence. If you find that you aren't getting many reads with barcode regions for it, you can try increasing -f (max. flank edit distance), as it looks like this sequence is a little longer than our default for RNA which also has -f 8.

I think it would also be really nice for us to add ATAC to our defaults (e.g. -d 10xATAC), so look out for this in future versions. If there are other barcoding schemes, we are also happy to add these to our list of defaults, just let us know!

Cheers, Nadia.

ljwharbers commented 2 days ago

Hi Nadia,

I will definitely optimize this a bit and see which -f and how long of a flanking sequence will be optimal.

I'll let you know once I've done so, it would be great to have it as a default indeed.

Cheers, Luuk