epi2me-labs / wf-clone-validation

Other
24 stars 18 forks source link

How to increase throughput and process samples in parallel #56

Closed joekitsmith closed 1 month ago

joekitsmith commented 1 month ago

I have been running wf-clone-validation in Google Cloud Batch but I've been observing a linear relationship between the number of samples and the time taken to run everything. My assumption was that running in GCP would allow for full horizontal scaling and I want to confirm if this is possible.

I'm launching the pipeline from a Google Cloud Run Job with the following resources:

Running a single sample takes around an hour whereas running 43 samples takes > 5 hours (I haven't run it long enough to get an accurate duration). There is clearly some scalability but given these samples are independent, I would expect the ability to process them concurrently.

I tried to implement my own parallel processing using Python's subprocess module but the pipeline recognised that one process had the lock file open and so some samples ended up not being processed.

I also tried using the --threads property and set it to 8 to match my vCPU count but this didn't improve throughput.

Is my ability to scale limited by the resource constraints of my Cloud Run Job? Or is there something else I'm missing to be able to fully horizontally scale this workflow and run 43 samples at roughly the same speed as 1.

N.B: I'm also trying to achieve the same thing with wf-amplicon

SamStudio8 commented 1 month ago

Hi @joekitsmith, we don't support execution of workflows in GCP. If you have questions on running workflows in GCP generally, I'd direct your questions to the Nextflow community at https://community.seqera.io/.