broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.64k stars 580 forks source link

Document how multi-threading support works in GATK4 #2345

Open magicDGS opened 7 years ago

magicDGS commented 7 years ago

In the classic GATK, walkers had the option to be multi-thread in two different ways:

Because now the new framework's walkers have only one apply() function, maybe the previous design is not applicable. Nevertheless, it will be useful to implement a way to allows a tool to apply the function in a multi-thread way. Is there any plan to implement something similar in GATK4?

droazen commented 7 years ago

In GATK4, the way to make a tool multithreaded is to implement it as a Spark tool. All Spark tools can be trivially parallelized across multiple threads using the local runner, and across a cluster using spark-submit or gcloud.

We wanted to avoid the complexities of implementing our own map/reduce framework, as was done in previous versions of the GATK, and instead rely on a standard, third-party framework to keep the GATK4 engine as simple as possible.

magicDGS commented 7 years ago

I don't know too much about Spark, so maybe I have a stupid question: how can be run a Spark tool be run in multiple threads in a single computer? I mean, that requires some setup of the local computer, doesn't it?

magicDGS commented 7 years ago

And thank you very much for the quick answer, @droazen!

lbergelson commented 7 years ago

@magicDGS You can run a spark tool on a single computer with N threads by specifying --sparkMaster 'local[N]' where N is the number of threads you want to use. If that number is big (>8ish) you might want to consider setting up a spark master and using yarn, which is a bit more complicated but not very difficult. If that's the case let me know and I can point you to some resources. For use on the average laptop it makes sense to just run with sparkMaster local.

magicDGS commented 7 years ago

Thanks @lbergelson. But that requires the gatk-launch script or just the shadow jar?

lbergelson commented 7 years ago

Ah, yes, that's kind of confusing actually...
The shadow jar includes a copy of spark and all of it's dependencies, so if you want to run spark tools locally you can use the shadow jar. If you want to use an existing spark cluster, which may have slightly different versions of spark/spark's dependencies then you need to use the spark jar. The spark jar doesn't include it's own copy of spark and expects that the spark cluster will provide the necessary dependencies. This avoids conflicts between different dependency versions.

You don't need to use gatk-launch ever, but it can make it easier if you want to potentially run your code in different environments. It knows about 3 different potential ways to invoke spark, 1) running in local mode with --sparkMaster local, 2) running on a cluster using spark-submit and 3) running on a google dataproc cluster using gcloud. Gatk-launch knows which environment needs which jar and will prompt you to create one if you don't have it.

gatk-launch also applies some default arguments when running on spark, you may have to supply them yourself if you're not using it.

magicDGS commented 7 years ago

Thanks a lot for this explanation. For the moment I'm more interested in the toolkit that I'm implementing, which does not have a master script for running. It is good to know that if I implement a Spark tool it could run locally with a shadow jar. I will study a bit more about spark in the near future.

So just the last question, and I will close this: why there are some tools that have a Spark version and a normal version in GATK4? If the Spark version could run locally, is there any performance issue related to run it without Spark?

Thanks a lot for all your answers, it is very informative :)

lbergelson commented 7 years ago

Well, you're welcome to use gatk-launch as a launch script if you'd like (and feel free to rename to whatever you like...) A

There are a few reasons we have spark and non-spark versions of the tools.

  1. We wanted to port and validate certain tools as quickly as possible and doing a direct port from gatk3 -> gatk4 was easier than making them sparkified at the same time.

  2. There's a tradeoff in using spark where you end up spending more total cpu hours in order to finish a job faster. Ideally this would be 1:1, double the number of cores and you halve the time to finish a job. It never scales perfectly though, there's always some overhead for being parallel. Our production pipelines are extremely sensitive to cost and not very sensitive to runtime, so they prefer we have a version that's optimized to use the least cpu hours even if that means a longer runtime. Other users prefer to be able to finish a job quickly and are willing to pay slightly more to do so, so we also have a spark version.

  3. Some tool are complicated to make work well spark. Spark works best when you can divide the input data into independent shards and then process them separately. This is complicated for things like the AssemblyRegion walker where you need context around each location of interest. We had to do things like add extra overlapping padding and things like that to avoid boundary issues where there are shard divisions.

We don't yet fully understand spark performance and it's caveats, we're looking into that actively now. We hope that we'll be able to optimize our tools so that a spark pipeline of several tools in series is faster than running the individual non-spark versions, since it lets us avoid doing things like loading the bam file multiple times from disk. Whether or not we can achieve this is still and open question though.

magicDGS commented 7 years ago

Thank you very much for the detailed answer, @lbergelson. I understand the point 1 and 3, but regarding 2: there is also a cost of running a Spark tool with 1 thread? I would love to use the framework for "sparkify" my tools, but I would like to be sure that there is no cost for running it locally without multi-thread...

droazen commented 7 years ago

We've found that, generally speaking, you do pay a penalty on single-core performance when becoming a Spark tool, but gain the ability to easily scale to multiple cores and get the job done quickly. This is why we've been maintaining both Spark and non-Spark versions of important tools. Whether this will be the case for your tools as well can only be determined by profiling.

If you extract the logic of your tool into a separate class, it's usually possible to call that shared code from both the Spark and walker frameworks without much or any code duplication. See BaseRecalibrator and BaseRecalibratorSpark for an example of this.

magicDGS commented 7 years ago

Thanks a lot for all your feedback about this @lbergelson and @droazen. From my side this could be close now, although it may be useful to have some of this information in the Wiki to avoid confusion.

Thank you very much again!

vdauwera commented 7 years ago

👍 to archiving this content on the wiki -- plenty of great information in here (I've been lurking)

droazen commented 7 years ago

Changing this into a documentation ticket.