Closed droazen closed 7 years ago
I ran FindBreakpointEvidenceSpark
and did some high-level checks to see if there are any opportunities for performance improvements. (cc @tedsharpe @cwhelan)
This is the command line I ran. (Earlier I had run more executors with smaller memory settings, but the job didn't complete then.)
./gatk-launch FindBreakpointEvidenceSpark \
-I hdfs:///user/$USER/broad-svdev-test-data/data/NA12878_PCR-_30X.bam \
-O hdfs:///user/$USER/broad-svdev-test-data/assembly \
--exclusionIntervals hdfs:///user/$USER/broad-svdev-test-data/reference/GRCh37.kill.intervals \
--kmersToIgnore hdfs:///user/$USER/broad-svdev-test-data/reference/Homo_sapiens_assembly38.dups \
-- \
--sparkRunner SPARK --sparkMaster yarn-client --sparkSubmitCommand spark2-submit\
--driver-memory 16G \
--num-executors 5 \
--executor-cores 7 \
--executor-memory 25G
What does FindBreakpointEvidenceSpark do, from the perspective of Spark?
A few observations:
Overall, it looks like it’s performing pretty well. There is very little data being shuffled relative to the size of the input (~6GB to 133GB input), so it’s not worth looking into optimizing the data structures there.
The input data is being read multiple times, so it might be worth seeing if it can be cached by Spark to avoid reading from disk over and over again. This is only worth it if you have sufficient memory available across the cluster to hold the input (which will be bigger than the on-disk size) plus enough memory for the processing, which as we saw is quite memory hungry anyway.
There might be some CPU efficiencies to pursue, especially if some code paths are creating a lot of objects that need garbage collecting (as Jobs 4 and 5 seem to be).
Jobs 4 and 5 seem to have some skew (judging from the task time distribution in the UI). You might investigate this by logging the amount of data that each task processes (or rather than logging, generating another output that is some description of the task data - or use a Spark accumulator), and then seeing if there's some way to make it more uniform.
Thanks very much for your analysis.
Job 4 does create a lot of garbage, but that appears to be inevitable whenever you are dealing with a PairRDD: You have to use a Tuple2 to represent key and value rather than using a more memory-conservative custom data object. You end up with a gazillion tiny objects that survive only during the shuffle. Too bad they didn't base PairRDD on an interface like Map.Entry. Also too bad that you cannot force a shuffle on a (plain old, non-Pair) RDD. Why not just treat it as a key-only structure and allow repartitioning? I mention this not merely to whine, but also in the faint hope that you've developed some helpful workarounds.
I don't think we have enough memory to persist the reads, but we can revisit that later.
Job 5 is doing a lot of computation. It's turning each read into kmers and testing each of those kmers to see if they exist in a large hash table. I don't think there's much opportunity for further optimization -- I knew this would be a bottleneck and tried my best to make the code efficient.
The skew in task size is definitely a problem, and I'll be looking for opportunities to address that issue.
Thanks again.
Closing -- this is done.
This is a place for @tomwhite to put his profiling results on the SV tools