Best effort auto-selection of MinHash.minBucketSize & MinHash.maxBucketSize if values are not provided.

It's tricky picking a good value for minBucketSize & maxBucketSize for new data/genome. If minBucketSize is set too low, then poor quality alignment candidates are let through. If it is set too high, then Shasta misses out on some good alignment candidates. One of the first pieces of feedback when someone assembles a genome using Shasta is to adjust this value so as to get better results. Turns out that we can use simple heuristics to find a not-terrible starting value.

Test Plan

Ran an E-Coli assembly before and after this change with MinHash.minBucketSize = 5. Verified that the relevant csv files were identical.

Ran an E-Coli assembly by passing in MinHash.minBucketSize = 0. Verified that a new value of minBucketSize was computed and used for each iteration.

Ran an HG002 assembly by passing in MinHash.minBucketSize = 0 and MinHash.maxBucketSize = 0. Verified that reasonable values were selected for both in every min hash iteration.

Tested only LowHash0.

chanzuckerberg / shasta

Best effort auto-selection of MinHash.minBucketSize & MinHash.maxBucketSize if values are not provided. #182

Test Plan