eliben / go-sentencepiece

Go implementation of the SentencePiece tokenizer
Apache License 2.0
20 stars 2 forks source link

optimize encoder #6

Closed josharian closed 1 month ago

josharian commented 2 months ago

This is a grab-bag of optimizations. I recommend reviewing commit-by-commit and rebasing instead of squashing.

Their cumulative effect, on my laptop, for an out-of-tree benchmark (sorry) is:

goos: darwin
goarch: arm64
pkg: bold.dev/tknz/gemma2b
cpu: Apple M3 Max
          │      a      │                  f                  │
          │   sec/op    │   sec/op     vs base                │
PureGo-16   3.494m ± 6%   2.642m ± 1%  -24.39% (p=0.000 n=15)

          │       a       │                  f                   │
          │     B/op      │     B/op      vs base                │
PureGo-16   1858.5Ki ± 0%   824.0Ki ± 0%  -55.66% (p=0.000 n=15)

          │       a       │                 f                  │
          │   allocs/op   │ allocs/op   vs base                │
PureGo-16   7574.000 ± 0%   4.000 ± 0%  -99.95% (p=0.000 n=15)

I also will understand if these are viewed as too intrusive/complicated for this codebase. :) I am happy to maintain a fork as needed.

eliben commented 2 months ago

Thanks, this is interesting. I'll find time to review these

I do wonder, however, about your use case. Can you elaborate where higher encoding performance is important to you, or is this just for fun? LLMs generally are pretty slow and in recent comparisons I've seen providers brag that their models reach something like 200 tokens/second (and that's probably on the beefiest HW). In comparison, unless I've messed up my benchmark, go-sentencepiece encodes at well over half a million tokens/sec, so it's hard for me to imagine a situation where this tokenization is a bottleneck.

josharian commented 2 months ago

I'll find time to review these

Great, thanks. I made some decisions along the way with not 100% confidence. Pushback/questions are--always--welcome.

Can you elaborate where higher encoding performance is important to you

As part of exploration and dataset preparation, I end up doing mass tokenization runs, so I really feel the performance--it directly impacts my iteration speed as I'm hacking around.

And I'm not working with frontier models (gemma, not gemini). And some of my work involves squeezing extra performance out, plus our use case is quite latency-sensitive.

I also generally have a habit of occasionally taking a day to crush any boxes I can in pprof. Cumulatively, that adds up.

Anyway, I put in this time to get to rough parity with the cgo implementation, at least for my use cases, so that I can switch to pure Go without guilt. Plus it was fun. :)

josharian commented 2 months ago

Will push up a revised copy once the question above is answered, and presuming you're happy with my other comments. Sorry for the delay, been a busy couple of days.