daulet / tokenizers

Go bindings for HuggingFace Tokenizer
MIT License
92 stars 23 forks source link

support more attributes from the Encoding structure #5

Closed clems4ever closed 1 year ago

clems4ever commented 1 year ago

MiniLM requires the attention mask to perform the mean pooling operation as can be seen at https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

clems4ever commented 1 year ago

Before

$ go test . -bench=. -benchmem -benchtime=10s
goos: linux
goarch: amd64
pkg: github.com/daulet/tokenizers
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkEncodeNTimes-8           885213             13053 ns/op             224 B/op         11 allocs/op
BenchmarkEncodeNChars-8         1000000000               2.351 ns/op           0 B/op          0 allocs/op
BenchmarkDecodeNTimes-8          2108638              5758 ns/op              96 B/op          3 allocs/op
BenchmarkDecodeNTokens-8        15591064               761.3 ns/op             7 B/op          0 allocs/op
PASS
ok      github.com/daulet/tokenizers    59.096s

After

$ go test . -bench=. -benchmem -benchtime=10s
goos: linux
goarch: amd64
pkg: github.com/daulet/tokenizers
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkEncodeNTimes-8           935011             12774 ns/op             232 B/op         12 allocs/op
BenchmarkEncodeNChars-8         1000000000               1.962 ns/op           0 B/op          0 allocs/op
BenchmarkDecodeNTimes-8          2098053              5676 ns/op              96 B/op          3 allocs/op
BenchmarkDecodeNTokens-8        15740354               742.0 ns/op             7 B/op          0 allocs/op
PASS
ok      github.com/daulet/tokenizers    57.765s