Improve Hash2Fragment by using a map to validate allowed sequence characters

matiasinsaurralde commented 10 months ago

Changes in this PR

Improves Hash2Fragment by using a map for more efficient validation of allowed sequence characters, discussion here.

Why are you making these changes?

General improvement.

Are any changes breaking? (IMPORTANT)

No

Pre-merge checklist

All of these must be satisfied before this PR is considered ready for merging. Mergeable PRs will be prioritized for review.

[ ] New packages/exported functions have docstrings.
[x] New/changed functionality is thoroughly tested.
[ ] New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
[ ] Changes are documented in CHANGELOG.md in the [Unreleased] section.
[x] All code is properly formatted and linted.
[x] The PR template is filled out.

TwFlem commented 10 months ago

I saw the small alphabet size in the PR and was curious what that threshold might be when a map performs better list for another issue I'm working on.

I created some benchmarks just to see what would happen with sequences with different possible alphabets of varying length. These were the results:

Map:

➜  poly git:(scratch) ✗ go test ./bwt  -bench=Map -benchmem
goos: linux
goarch: amd64
pkg: github.com/TimothyStiles/poly/bwt
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkMapSmallAlphaWithNoMistakes-16                       33          36787498 ns/op          226393 B/op       1917 allocs/op
BenchmarkMapSmallAlphaWithSomeMistakes-16                   1101            951373 ns/op          966786 B/op      20057 allocs/op
BenchmarkMapSmallAlphaWithManyMistakes-16                   1791            654668 ns/op          964172 B/op      20035 allocs/op
BenchmarkMapCompleteAlphaWithNoMistakes-16                    30          41200890 ns/op          249033 B/op       2109 allocs/op
BenchmarkMapCompleteAlphaWithSomeMistakes-16                1154           1061187 ns/op          966474 B/op      20054 allocs/op
BenchmarkMapCompleteAlphaWithManyMistakes-16                1746            715031 ns/op          964279 B/op      20036 allocs/op
BenchmarkMapCompleteAlphaOopsAllMistakes-16                 1928            599078 ns/op          963875 B/op      20032 allocs/op

Contains:

➜  poly git:(scratch) ✗ go test ./bwt  -bench=Contains -benchmem
goos: linux
goarch: amd64
pkg: github.com/TimothyStiles/poly/bwt
cpu: 12th Gen Intel(R) Core(TM) i7-1260P
BenchmarkContainsSmallAlphaWithNoMistakes-16                 109          10461112 ns/op           68541 B/op        580 allocs/op
BenchmarkContainsSmallAlphaWithSomeMistakes-16              1536            756558 ns/op          964864 B/op      20041 allocs/op
BenchmarkContainsSmallAlphaWithManyMistakes-16              1892            630320 ns/op          963949 B/op      20033 allocs/op
BenchmarkContainsCompleteAlphaWithNoMistakes-16               96          12359970 ns/op           77822 B/op        659 allocs/op
BenchmarkContainsCompleteAlphaWithSomeMistakes-16           1490            831489 ns/op          965014 B/op      20042 allocs/op
BenchmarkContainsCompleteAlphaWithManyMistakes-16           1742            642530 ns/op          964289 B/op      20036 allocs/op
BenchmarkContainsCompleteAlphaOopsAllMistakes-16            2058            554534 ns/op          963630 B/op      20030 allocs/op

Contains happened to perform better in these bench marks in a significant way. I wonder if caching starts to come into play with all the different data involved. Including the benchmark code and cpu profile in the comment.

Also, Contains seems to be more performant in the Small/Complete alphabet with no mistakes case, which seems like it would be the most common case.

*Note: updated because because I modified the code to call something that looks like HashToFragment and also saw the the benchmark was inlining the contains. Now that the no inline is gone, the worst case with BenchmarkContainsCompleteAlphaOopsAllMistakes performing worse than the map equivalent makes more sense. Before it was performing wayyyyyyyy better which was a little weird.

TwFlem commented 10 months ago

cpu profile (ran them together for the profile. They were ran separately for the results above)

cpu map-vs-contains inseqHashFn

code

Koeng101 commented 10 months ago

Also, Contains seems to be more performant in the Small/Complete alphabet with no mistakes case, which seems like it would be the most common case.

Interesting! @matiasinsaurralde Do you have any thoughts on this? The use case of no-mistakes is definitely the most common by an order of magnitude or more

matiasinsaurralde commented 10 months ago

@Koeng101 Agree with that, feel free to discard/ignore my suggestion

Koeng101 commented 10 months ago

Closing pull request, because it seems benchmarks show better efficiency with a Contains with our particular use case.

bebop / poly