How to increase throughput and process samples in parallel

I have been running wf-clone-validation in Google Cloud Batch but I've been observing a linear relationship between the number of samples and the time taken to run everything. My assumption was that running in GCP would allow for full horizontal scaling and I want to confirm if this is possible.

I'm launching the pipeline from a Google Cloud Run Job with the following resources:

8 vCPU
32GB RAM

Running a single sample takes around an hour whereas running 43 samples takes > 5 hours (I haven't run it long enough to get an accurate duration). There is clearly some scalability but given these samples are independent, I would expect the ability to process them concurrently.

I tried to implement my own parallel processing using Python's subprocess module but the pipeline recognised that one process had the lock file open and so some samples ended up not being processed.

I also tried using the --threads property and set it to 8 to match my vCPU count but this didn't improve throughput.

Is my ability to scale limited by the resource constraints of my Cloud Run Job? Or is there something else I'm missing to be able to fully horizontally scale this workflow and run 43 samples at roughly the same speed as 1.

N.B: I'm also trying to achieve the same thing with wf-amplicon

epi2me-labs / wf-clone-validation

How to increase throughput and process samples in parallel #56