daulet / tokenizers

Go bindings for HuggingFace Tokenizer
MIT License
92 stars 23 forks source link

Performance regression #14

Open daulet opened 1 year ago

daulet commented 1 year ago

We've regressed in benchmarks quite a bit from initial release.

benchstat benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt benchmarks/$(git rev-parse HEAD).txt
goos: darwin
goarch: arm64
pkg: github.com/daulet/tokenizers
                 │ benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt │ benchmarks/38a9a14c1c56b113461b0c7350c72de949e23cc2.txt │
                 │                         sec/op                          │             sec/op               vs base                │
EncodeNTimes-10                                               11.99µ ±  3%                     13.11µ ±   1%    +9.39% (p=0.002 n=6)
EncodeNChars-10                                               2.584n ±  8%                     2.989n ± 272%         ~ (p=0.485 n=6)
DecodeNTimes-10                                               1.701µ ±  3%                     4.535µ ±   2%  +166.66% (p=0.002 n=6)
DecodeNTokens-10                                              193.6n ± 10%                     656.1n ±   3%  +238.78% (p=0.002 n=6)
geomean                                                       317.8n                           584.3n          +83.86%

                 │ benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt │ benchmarks/38a9a14c1c56b113461b0c7350c72de949e23cc2.txt │
                 │                          B/op                           │             B/op               vs base                  │
EncodeNTimes-10                                               84.00 ± 0%                       232.00 ± 0%  +176.19% (p=0.002 n=6)
EncodeNChars-10                                               0.000 ± 0%                        0.000 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTimes-10                                               96.00 ± 0%                        96.00 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTokens-10                                              7.000 ± 0%                        7.000 ± 0%         ~ (p=1.000 n=6) ¹
geomean                                                                  ²                                   +28.91%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                 │ benchmarks/3188ded27885d1002698a0e25f0b32306c430e88.txt │ benchmarks/38a9a14c1c56b113461b0c7350c72de949e23cc2.txt │
                 │                        allocs/op                        │           allocs/op            vs base                  │
EncodeNTimes-10                                               4.000 ± 0%                       12.000 ± 0%  +200.00% (p=0.002 n=6)
EncodeNChars-10                                               0.000 ± 0%                        0.000 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTimes-10                                               3.000 ± 0%                        3.000 ± 0%         ~ (p=1.000 n=6) ¹
DecodeNTokens-10                                              0.000 ± 0%                        0.000 ± 0%         ~ (p=1.000 n=6) ¹
geomean                                                                  ²                                   +31.61%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean
daulet commented 1 year ago

CC @clems4ever @RJKeevil in case you'd be interesting in looking into this.

daulet commented 4 months ago

I actually root caused it to this commit in the upstream library.