It's tricky picking a good value for minBucketSize & maxBucketSize for new data/genome. If minBucketSize is set too low, then poor quality alignment candidates are let through. If it is set too high, then Shasta misses out on some good alignment candidates. One of the first pieces of feedback when someone assembles a genome using Shasta is to adjust this value so as to get better results. Turns out that we can use simple heuristics to find a not-terrible starting value.
Test Plan
Ran an E-Coli assembly before and after this change with MinHash.minBucketSize = 5. Verified that the relevant csv files were identical.
Ran an E-Coli assembly by passing in MinHash.minBucketSize = 0. Verified that a new value of minBucketSize was computed and used for each iteration.
Ran an HG002 assembly by passing in MinHash.minBucketSize = 0 and MinHash.maxBucketSize = 0. Verified that reasonable values were selected for both in every min hash iteration.
In a "healthy" assembly, the minimum and maximum in the histogram are much more pronounced than in the illustration. This has no implications on the PR though.
It's tricky picking a good value for
minBucketSize
&maxBucketSize
for new data/genome. IfminBucketSize
is set too low, then poor quality alignment candidates are let through. If it is set too high, then Shasta misses out on some good alignment candidates. One of the first pieces of feedback when someone assembles a genome using Shasta is to adjust this value so as to get better results. Turns out that we can use simple heuristics to find a not-terrible starting value.Test Plan
Ran an E-Coli assembly before and after this change with
MinHash.minBucketSize = 5
. Verified that the relevant csv files were identical.Ran an E-Coli assembly by passing in
MinHash.minBucketSize = 0
. Verified that a new value of minBucketSize was computed and used for each iteration.Ran an HG002 assembly by passing in
MinHash.minBucketSize = 0
andMinHash.maxBucketSize = 0
. Verified that reasonable values were selected for both in every min hash iteration.Tested only LowHash0.