broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Profile SV tools #2458

Closed droazen closed 7 years ago

droazen commented 7 years ago

This is a place for @tomwhite to put his profiling results on the SV tools

tomwhite commented 7 years ago

I ran FindBreakpointEvidenceSpark and did some high-level checks to see if there are any opportunities for performance improvements. (cc @tedsharpe @cwhelan)

This is the command line I ran. (Earlier I had run more executors with smaller memory settings, but the job didn't complete then.)

./gatk-launch FindBreakpointEvidenceSpark \
  -I hdfs:///user/$USER/broad-svdev-test-data/data/NA12878_PCR-_30X.bam \
  -O hdfs:///user/$USER/broad-svdev-test-data/assembly \
  --exclusionIntervals hdfs:///user/$USER/broad-svdev-test-data/reference/GRCh37.kill.intervals \
  --kmersToIgnore hdfs:///user/$USER/broad-svdev-test-data/reference/Homo_sapiens_assembly38.dups \
  -- \
  --sparkRunner SPARK --sparkMaster yarn-client --sparkSubmitCommand spark2-submit\
  --driver-memory 16G \
  --num-executors 5 \
  --executor-cores 7 \
  --executor-memory 25G

What does FindBreakpointEvidenceSpark do, from the perspective of Spark?

A few observations:

Overall, it looks like it’s performing pretty well. There is very little data being shuffled relative to the size of the input (~6GB to 133GB input), so it’s not worth looking into optimizing the data structures there.

The input data is being read multiple times, so it might be worth seeing if it can be cached by Spark to avoid reading from disk over and over again. This is only worth it if you have sufficient memory available across the cluster to hold the input (which will be bigger than the on-disk size) plus enough memory for the processing, which as we saw is quite memory hungry anyway.

There might be some CPU efficiencies to pursue, especially if some code paths are creating a lot of objects that need garbage collecting (as Jobs 4 and 5 seem to be).

Jobs 4 and 5 seem to have some skew (judging from the task time distribution in the UI). You might investigate this by logging the amount of data that each task processes (or rather than logging, generating another output that is some description of the task data - or use a Spark accumulator), and then seeing if there's some way to make it more uniform.

tedsharpe commented 7 years ago

Thanks very much for your analysis.

Job 4 does create a lot of garbage, but that appears to be inevitable whenever you are dealing with a PairRDD: You have to use a Tuple2 to represent key and value rather than using a more memory-conservative custom data object. You end up with a gazillion tiny objects that survive only during the shuffle. Too bad they didn't base PairRDD on an interface like Map.Entry. Also too bad that you cannot force a shuffle on a (plain old, non-Pair) RDD. Why not just treat it as a key-only structure and allow repartitioning? I mention this not merely to whine, but also in the faint hope that you've developed some helpful workarounds.

I don't think we have enough memory to persist the reads, but we can revisit that later.

Job 5 is doing a lot of computation. It's turning each read into kmers and testing each of those kmers to see if they exist in a large hash table. I don't think there's much opportunity for further optimization -- I knew this would be a bottleneck and tried my best to make the code efficient.

The skew in task size is definitely a problem, and I'll be looking for opportunities to address that issue.

Thanks again.

droazen commented 7 years ago

Closing -- this is done.