chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
272 stars 59 forks source link

Best effort auto-selection of MinHash.minBucketSize & MinHash.maxBucketSize if values are not provided. #182

Closed bagashe closed 4 years ago

bagashe commented 4 years ago

It's tricky picking a good value for minBucketSize & maxBucketSize for new data/genome. If minBucketSize is set too low, then poor quality alignment candidates are let through. If it is set too high, then Shasta misses out on some good alignment candidates. One of the first pieces of feedback when someone assembles a genome using Shasta is to adjust this value so as to get better results. Turns out that we can use simple heuristics to find a not-terrible starting value.

Test Plan

Ran an E-Coli assembly before and after this change with MinHash.minBucketSize = 5. Verified that the relevant csv files were identical.

Ran an E-Coli assembly by passing in MinHash.minBucketSize = 0. Verified that a new value of minBucketSize was computed and used for each iteration.

Ran an HG002 assembly by passing in MinHash.minBucketSize = 0 and MinHash.maxBucketSize = 0. Verified that reasonable values were selected for both in every min hash iteration.

Tested only LowHash0.

paoloczi commented 4 years ago

In a "healthy" assembly, the minimum and maximum in the histogram are much more pronounced than in the illustration. This has no implications on the PR though.