NVIDIA-Genomics-Research / GenomeWorks

SDK for GPU accelerated genome assembly and analysis
https://clara-parabricks.github.io/GenomeWorks/
Apache License 2.0
286 stars 76 forks source link

[cudamapper] Default parameters should be hardened against many datasets / GPUs, match those of minimap2 #471

Closed edawson closed 4 years ago

edawson commented 4 years ago

Cudamapper is supposed to be a drop-in replacement for minimap2. Our default parameters, however, differ from those of minimap2. In addition, they seem to be unstable on many GPUs, causing crashes.

This is a major barrier to entry for users - they want the program to run to completion, even at the expense of performance.

Minimap2 defaults:

Most of these can be addressed by either a single cudamapper parameter or a combination of multiple parameters.

In the case of minibatch size, it's not so much about matching the number as it is about providing a stable CLI. I find I'm often having to tweak the -I, -i, -q, -Q, -c, -C, -m parameters to balance memory usage (e.g. to prevent out-of-memory errors) and performance. I think we should establish safe defaults for long reads on 8GB, 16GB and 32GB memory GPUs. Even though we programatically check for max preallocated memory we often seem to OOM due to index size parameterizations. My vote would be to prioritize stability and provide a one-pager on tuning for max performance (acknowledging the limits of each GPU considering maximum read size).

edawson commented 4 years ago

Here's an example where cudamapper fails with default parameters on a small (~65MB) FASTA file, likely because its longest read is >300 kilobases in length.

(base) eric@odin:~/sandbox/clara/build$ time ./cudamapper/cudamapper -k 15 -w 10 -m 6 ~/droso_split_1.fa ~/droso_split_1.fa > droso1.paf-C / --target-indices-in-host-memory not set, using -Q / --query-indices-in-host-memory value: 10
-c / --target-indices-in-device-memory not set, using -q / --query-indices-in-device-memory value: 5
NOTE - Since query and target files are same, activating all_to_all mode. Query index size used for both files.
Query file: /home/eric/droso_split_1.fa, number of reads: 9999
Target file: /home/eric/droso_split_1.fa, number of reads: 9999
Using device memory cache of 6442450944 bytes
Device 0 took batch 1 out of 1 batches in total
terminate called after throwing an instance of 'claraparabricks::genomeworks::device_memory_allocation_exception'
  what():  Could not allocate device memory!
Aborted (core dumped)

real    0m0.576s
user    0m0.285s
sys     0m0.141s

minimap2 overlaps the file and takes ~4.5 seconds on six cores.

(base) eric@odin:~/sandbox/clara/build$ time ~/sandbox/minimap2/minimap2 -x ava-ont -X -k 15 -w 10 -t 6 ~/droso_split_1.fa ~/droso_split_1.fa > droso1.mm2.paf
[M::mm_idx_gen::1.012*1.22] collected minimizers
[M::mm_idx_gen::1.130*1.72] sorted minimizers
[M::main::1.130*1.72] loaded/built the index for 10000 target sequence(s)
[M::mm_mapopt_update::1.235*1.66] mid_occ = 45
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 10000
[M::mm_idx_stat::1.306*1.62] distinct minimizers: 10626237 (89.07% are singletons); average occurrences: 1.218; average spacing: 5.341; total length: 69125460
[M::worker_pipeline::4.614*2.26] mapped 10000 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: /home/eric/sandbox/minimap2/minimap2 -x ava-ont -X -k 15 -w 10 -t 6 /home/eric/droso_split_1.fa /home/eric/droso_split_1.fa
[M::main] Real time: 4.624 sec; CPU: 10.453 sec; Peak RSS: 0.603 GB

real    0m4.640s
user    0m10.270s
sys     0m0.198s

Adjusting the -i parameter of cudamapper allows the code to complete. Runtime is under 1 second on a Geforce RTX 2060 Super.

(base) eric@odin:~/sandbox/clara/build$ time ./cudamapper/cudamapper -k 15 -w 10 -i 10 -m 6 ~/droso_split_1.fa ~/droso_split_1.fa > droso1.paf-C / --target-indices-in-host-memory not set, using -Q / --query-indices-in-host-memory value: 10
-c / --target-indices-in-device-memory not set, using -q / --query-indices-in-device-memory value: 5
NOTE - Since query and target files are same, activating all_to_all mode. Query index size used for both files.
Query file: /home/eric/droso_split_1.fa, number of reads: 9999
Target file: /home/eric/droso_split_1.fa, number of reads: 9999
Using device memory cache of 6442450944 bytes
Device 0 took batch 1 out of 1 batches in total

real    0m1.423s
user    0m1.124s
sys     0m0.237s
edawson commented 4 years ago

This is now addressed in #478 . I modified several of the parameters which were causing issues, which seems to have improved accuracy on staph/E. coli/ drosophila. However, we're outputting a ton of PAF records (sometimes 2X to 10X the number of minimap2). This affects downstream assembly runtime. Planning to address in a future PR.