Shrink NIO buffers sizes for GenomicsDBImport down to the smallest values that still give good performance

droazen commented 7 years ago

--cloudPrefetchBuffer and --cloudIndexPrefetchBuffer default to 40 MB, which is far too large for the GenomicsDBImport tool, as we are allocating 80 MB of buffer per reader. This will limit the number of samples we can batch together, and degrade GenomicsDB performance as a result.

Let's find the smallest values for these arguments that still give acceptable performance. We can do this by running GenomicsDBImport with, eg., 10 samples with the default settings, then repeatedly halving the cloud buffer sizes until performance degrades too much.

droazen commented 7 years ago

@Horneth of red team has volunteered to take this on this week

droazen commented 7 years ago

Note that it's possible that we will need more buffer space for the index than for the main file -- we should experiment both with runs that pass in the same value for --cloudPrefetchBuffer and --cloudIndexPrefetchBuffer, and runs that pass in different values for these arguments.

droazen commented 7 years ago

Example command line for running GenomicsDBImport:

./gatk-launch GenomicsDBImport --genomicsDBWorkspace ${WORKSPACE_DIR} -L 20:1-1000 -V gs://bucket/sample1.g.vcf -V gs://bucket/sample2.g.vcf etc...

Where ${WORKSPACE_DIR} is the directory to which to write the output, and the -L option is given a genomic interval of appropriate size. Note that you'll want to consult @eitanbanks or @kcibul to find out what a realistic/plausible interval size for this tool will be in the final pipeline.

Horneth commented 7 years ago

I shared a Gdoc with some info about the runs I've done so far. As you'll see the 2 flags don't seem to make much difference in terms of runtime and even memory usage on the VM which is surprising. Am I missing something ?

lbergelson commented 7 years ago

@Horneth Thank you.

It's very strange. I would expect to see a difference. I wonder if something is wrong with the wiring of those arguments. What version of gatk are you running? (you can find out by using --version`)

I'm not sure what to make of the memory usage. I'm not sure that free will tell you anything useful about java memory, since java usually expands to fill all available memory and then garbage collects as needed. A better estimate of memory use might be to set the XX:-PrintGCDetails jvm option and look at what's retained after garbage collection. Or if you're logging into the vms, something like jstat or a proper profiler can tell a lot more about the memory usage.

lbergelson commented 7 years ago

Hmn, on further inspection, I'm not sure there is any version that didn't have those arguments wired up. Maybe there's a bug somewhere in there...

Horneth commented 7 years ago

I found something interesting maybe. For this run http://jg-11k.dsde-cromwell-dev.broadinstitute.org:8000/api/workflows/v2/ff0fd97b-e5f2-4cab-84de-672d706013c7/timing

The first 2 shards (40MB and 20MB) failed with OutOfMemory but the other ones didn't (10, 5, 2, 1, 0) The timing doesn't show that because as soon as the first one failed Cromwell stopped tracking the other ones but I looked at all the stderr manually. So it does seem to be using the value properly. I'm going to run them separately so we can see the actual runtime for each of them.

Horneth commented 7 years ago

I updated the doc with some more info. TLDR: with 100 samples there is a visible difference between different buffer size in terms of memory usage. However the tool seems to not always exit properly when it runs out of memory, leaving the VM hanging.

kcibul commented 7 years ago

How much memory are you giving java (-Xmx parameter) and how much total memory does the VM have? I've seen odd behavior when the VM doesn't have enough headroom (~1-1.5gb) and processes start to die once Java consumes everything it can right before it OOMs. Might be why you're seeing hanging on OOM errors?

Kristian Cibulskis Engineering Director, Data Sciences & Data Engineering Broad Institute of MIT and Harvard kcibul@broadinstitute.org

On Wed, May 3, 2017 at 8:59 AM, Thib notifications@github.com wrote:

I updated the doc with some more info. TLDR: with 100 samples there is a visible difference between different buffer size in terms of memory usage. However the tool seems to not exit properly when it runs out of memory, leaving the VM hanging.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/2640#issuecomment-298903456, or mute the thread https://github.com/notifications/unsubscribe-auth/ABW4g2eDXRdEgkD0zMDRl44RjDvg-4cbks5r2HoxgaJpZM4NNEOf .

Horneth commented 7 years ago

I asked for 32GB, got ~50 and gave the JVM 28GB.

droazen commented 7 years ago

@Horneth:

With 100 samples, is the tool always running out of memory, or is it only running out of memory with certain buffer sizes?
If I'm interpreting your document correctly, you found that with a buffer size of 0 or 1, performance degrades by 10%, is that right?
Can you post the file sizes involved here? Both the size of the original GVCF inputs, and the size of the data from those inputs that overlaps your interval (you can find out the latter by running GATK4 SelectVariants on the GVCF using the same interval, and recording the resulting file size). I'd also like to know the sizes of the index files.
It would be good if you could post your profiling results directly in this ticket, rather than in a Google doc, so that @jean-philippe-martin (who has implemented the GCS support in GATK) can easily see them.

@jean-philippe-martin Could you chime in here to remind us of your profiling results with NIO buffer sizes and the use case of a single query interval? Didn't you find that there was a very large performance difference between running with and without buffering?

jean-philippe-martin commented 7 years ago

I can confirm that in my own extensive runs of the tools I've seen it sometimes get hung when out of memory instead of exiting (resulting in bad data for that run). My own benchmarking was with PrintReads on a large file, using various cache buffer sizes. I remember seeing an increase in performance up to about 50MB buffer size and then it flattened out. I would expect a more CPU-intensive (or a more heavily loaded machine) would reduce the impact of the buffer size as I/O ceases to be the bottleneck.

I also ran experiments on a 1-cpu machine with VCF and loading a single interval, which I think matches what you are asking about. In that experiment 10MB was enough, adding to the cache did not bring any improvement and of course too large a cache leads to running out of memory.

jean-philippe-martin commented 7 years ago

@Horneth: what @lbergelson said. Java will use all of the memory you give it, regardless of how much the program asks for. The only difference is how long it takes until the memory's used up and the garbage collector needs to kick in. Thus lowering the memory requirements of a Java program tends to increase its performance, up to the point where we cut memory that it needed.

I would suggest either measuring the memory immediately after a garbage collection or, more directly, measure the performance of the program as you vary buffer sizes. Make sure to control for cache effects since the program's doing I/O.

kcibul commented 7 years ago

btw -- here is a representative shard/interval (chr1:820849-1000848)

droazen commented 7 years ago

@Horneth Are you using single-core machines? You should be running this tool on at least 2 cores, as the GCS NIO code uses 2 threads.

@jean-philippe-martin What kind of speedup did you get with buffer vs. no buffer for a VCF on a multi-core machine?

lbergelson commented 7 years ago

I assume this is on a many core machine since it has upwards of 30g memory...

As an addendum to what @jean-philippe-martin was saying about garbage collection, we're interested in the maximum size of the memory AFTER a full garbage collection, which should roughly correspond to the minimum required memory to run the program. You can have java output information about this by providing the -XX:-PrintGC and-XX:-PrintGCDetails JVM flags.

Horneth commented 7 years ago

Here is what I have so far: https://gist.github.com/Horneth/4e52b94ae1f5a5ea654c2fa0f3a4139e

droazen commented 7 years ago

We've shrunk the defaults for the tool down to 2 MB in https://github.com/broadinstitute/gatk/pull/2671 based on our initial results. This may be adjusted further in response to additional profiling.

droazen commented 7 years ago

This is done! Closing.

broadinstitute / gatk

Shrink NIO buffers sizes for GenomicsDBImport down to the smallest values that still give good performance #2640