output files separately for each sample

istvankleijn commented 1 year ago

Is your feature related to a problem?

I am re-basecalling some of our previous experiments using super-accurate models, on the same GridION that the experiments were originally performed with. The files are organised by sample, i.e. one directory per flow cell, and I would like to generate alignment files for many (but not all) samples. In essence:

/data/
  |-- expA/
    |-- sample1/*/*/*.fast5
    |-- sample2/...
  |-- expB/
    |-- sample3/...
    |-- sample4/...
  |-- expC/
    |-- sample5/...

And I'd like to get, say, expA.sample1.{pass,fail}.bam, expA.sample2.{pass.fail}.bam, and expC.sample5.{pass.fail}.bam, with all the basecalled reads for those samples. (I have many more directories than this.)

Describe the solution you'd like

I wonder if there is a way to iterate over the directory structure in some intelligent way, and output basecalled files separately for each sample. Or, alternatively, to set multiple instances of wf-basecalling to run sequentially.

Describe alternatives you've considered

What I am doing now is running wf-basecalling in batches of two. I tried starting more than two instances, but the third one ran into errors due to insufficient memory. That means the machine will idle when I do not start new batches in time, and I have to keep interacting with it.

Another alternative I can think of is running wf-basecalling on the top-level directory, which would give me two large alignment files, one for all passes and one for all fails. Then once it all finishes, split those two files back into the separate samples. But that would leave me waiting ages to finally get all the results in one go rather than getting a steady trickle of samples coming through, and I am guessing the duplicated effort of merging and splitting will take a while longer as well.

Additional context

Perhaps I should attempt to run this on a cluster instead, but it would be nice to make use of all that built-in GPU power sitting right there...

SamStudio8 commented 1 year ago

wf-basecalling does not currently support handling of more than one sample in this way but we are looking at ways to improve the "ingress" part of the pipeline to be more flexible. I appreciate that won't help you now but in the mean time it would be possible to write a bash script to iterate over your samples and call the wf-basecalling workflow from the command line for each one. nextflow run epi2me-labs/wf-basecalling --help will give an example of the command line invocation.

mantczakaus commented 7 months ago

Hi, our lab would also appreciate the possibility of running the pipeline on multiple samples, e.g. using a sample-sheet. I was wondering if there is a concrete plan to implement that and if so, when could we expect this enhancement? Best wishes, Magdalena Antczak

epi2me-labs / wf-basecalling