Profile SV tools - Githubissues

droazen commented 7 years ago

This is a place for @tomwhite to put his profiling results on the SV tools

tomwhite commented 7 years ago

I ran FindBreakpointEvidenceSpark and did some high-level checks to see if there are any opportunities for performance improvements. (cc @tedsharpe @cwhelan)

This is the command line I ran. (Earlier I had run more executors with smaller memory settings, but the job didn't complete then.)

./gatk-launch FindBreakpointEvidenceSpark \
  -I hdfs:///user/$USER/broad-svdev-test-data/data/NA12878_PCR-_30X.bam \
  -O hdfs:///user/$USER/broad-svdev-test-data/assembly \
  --exclusionIntervals hdfs:///user/$USER/broad-svdev-test-data/reference/GRCh37.kill.intervals \
  --kmersToIgnore hdfs:///user/$USER/broad-svdev-test-data/reference/Homo_sapiens_assembly38.dups \
  -- \
  --sparkRunner SPARK --sparkMaster yarn-client --sparkSubmitCommand spark2-submit\
  --driver-memory 16G \
  --num-executors 5 \
  --executor-cores 7 \
  --executor-memory 25G

What does FindBreakpointEvidenceSpark do, from the perspective of Spark?

[runTool] filter out secondary and supplementary alignments
[getMappedQNamesSet] filter out duplicate reads, reads that failed vendor checks, unmapped reads
Job 0 [ReadMetadata] mapPartitions to find partition stats
Job 1 [getIntervals] filter and multiple map partitions to find breakpoint intervals
Job 2 [removeHighCoverageIntervals] mapPartitionsToPair to find coverage for each interval, then reduceByKey
Job 3 [getQNames] mapPartitions
Job 4 [addAssemblyQNames -> getKmerIntervals] mapPartitionsToPair, then reduceByKey, then mapPartitions
Job 5 [getAssemblyQNames] mapPartitions twice and a collect
Job 6 [generateFastqs] mapPartitions and combineByKey, then write FASTQ to files

A few observations:

Jobs 0,1,3 are simple map jobs - very quick 1-2 mins each.
Job 2 is a simple MR, with a tiny shuffle to sum by key (<1MB of shuffle data)
Job 4 takes a bit longer longer (3min), and shuffles ~3GB. This is a lot faster than when I ran it before with less memory, when it took 9 min. Is it creating a lot of garbage? If you wanted to speed things up you might look at what this is doing on a local machine and see if there are any opportunities to improve CPU efficiency.
Job 5 takes ~8 mins, and has no shuffle. CPU intensive processing again?
Job 6 take a little over 3 mins, shuffling ~3GB.

Overall, it looks like it’s performing pretty well. There is very little data being shuffled relative to the size of the input (~6GB to 133GB input), so it’s not worth looking into optimizing the data structures there.

The input data is being read multiple times, so it might be worth seeing if it can be cached by Spark to avoid reading from disk over and over again. This is only worth it if you have sufficient memory available across the cluster to hold the input (which will be bigger than the on-disk size) plus enough memory for the processing, which as we saw is quite memory hungry anyway.

There might be some CPU efficiencies to pursue, especially if some code paths are creating a lot of objects that need garbage collecting (as Jobs 4 and 5 seem to be).

Jobs 4 and 5 seem to have some skew (judging from the task time distribution in the UI). You might investigate this by logging the amount of data that each task processes (or rather than logging, generating another output that is some description of the task data - or use a Spark accumulator), and then seeing if there's some way to make it more uniform.

tedsharpe commented 7 years ago

Thanks very much for your analysis.

Job 4 does create a lot of garbage, but that appears to be inevitable whenever you are dealing with a PairRDD: You have to use a Tuple2 to represent key and value rather than using a more memory-conservative custom data object. You end up with a gazillion tiny objects that survive only during the shuffle. Too bad they didn't base PairRDD on an interface like Map.Entry. Also too bad that you cannot force a shuffle on a (plain old, non-Pair) RDD. Why not just treat it as a key-only structure and allow repartitioning? I mention this not merely to whine, but also in the faint hope that you've developed some helpful workarounds.

I don't think we have enough memory to persist the reads, but we can revisit that later.

Job 5 is doing a lot of computation. It's turning each read into kmers and testing each of those kmers to see if they exist in a large hash table. I don't think there's much opportunity for further optimization -- I knew this would be a bottleneck and tried my best to make the code efficient.

The skew in task size is definitely a problem, and I'll be looking for opportunities to address that issue.

Thanks again.

droazen commented 7 years ago

Closing -- this is done.

broadinstitute / gatk

Profile SV tools #2458