BD2KGenomics / toil-scripts

Toil workflows for common genomic pipelines
Apache License 2.0
32 stars 18 forks source link

ADAM Recompute #130

Open fnothaft opened 8 years ago

fnothaft commented 8 years ago

Related to #113, but all the samples, not just 10, and just ADAM, no GATK.

hannes-ucsc commented 8 years ago

In lieu of BD2KGenomics/toil#644, BD2KGenomics/toil#703 BD2KGenomics/toil#706 and we are pondering the alternative solution of running a single, separate, large Spark cluster, provisioned by cgcloud, that is scaled up on demand. Thoughts:

  1. We can run 260 alignments and GATKs in parallel but we can't run 260 ADAM jobs in parallel. IOW, we need a throttle on the number of ADAM jobs running concurrently.
    • I'm currently thinking of just emulating a semaphore S on SimpleDB. The semaphore S is initialized to the max number of concurrent ADAM jobs, i.e. jobs of the type that @fnothaft implements in #129, lets call it N. So initially S=N. We insert jobs before and after the ADAM job in the DAG that acquire (S--) and release (S++) the semaphore respectively. This should work well with Toil retries. The überscript scales the Spark cluster up to M = (N - S) * P with P being the number of Spark nodes allocated per ADAM job.
  2. The Spark cluster never gets scaled down.
  3. Does cgcloud's Spark clusters support automatic upscaling?
fnothaft commented 8 years ago

Thanks for writing this up @hannes-ucsc; I was stuck in meetings from our scrum until 5PM and just got home, etc.

As an aside, are you/@benedictpaten working on https://github.com/BD2KGenomics/toil/issues/703?

fnothaft commented 8 years ago

@fnothaft #129

See #136.

@fnothaft #137

See #138.

hannes-ucsc commented 8 years ago

As an aside, are you/@benedictpaten working on BD2KGenomics/toil#703?

Benedict will be. But he wants to do things right which will take longer than would fit in the time frame of this paper.

I am now leaning towards the single cluster solution and removing the dependence on BD2KGenomics/toil#706 and BD2KGenomics/toil#703 from this issue. If you agree, I will remove them from the task list. I will work on testing spot instances with cgcloud Spark clusters.

fnothaft commented 8 years ago

As an aside, are you/@benedictpaten working on BD2KGenomics/toil#703? Benedict will be. But he wants to do things right which will take longer than would fit in the time frame of this paper.

I am now leaning towards the single cluster solution and removing the dependence on BD2KGenomics/toil#706 and BD2KGenomics/toil#703 from this issue. If you agree, I will remove them from the task list. I will work on testing spot instances with cgcloud Spark clusters.

Let's discuss this on scrum today. Can @benedictpaten join us for scrum?

fnothaft commented 8 years ago

@hannes-ucsc how much work is involved in changing the Spark version of cgcloud.spark?

fnothaft commented 8 years ago

@hannes-ucsc how much work is involved in changing the Spark version of cgcloud.spark?

Looks pretty straightforward, actually.