daulet / tokenizers

Go bindings for HuggingFace Tokenizer
MIT License
85 stars 23 forks source link

memory issues when using tokenizers #19

Closed homily707 closed 1 month ago

homily707 commented 3 months ago

As I utilize the tokenizer, I've observed a continuous rise in memory consumption. Based on the discussions in https://github.com/golang/go/issues/53440 and the insights provided by https://dgraph.io/blog/post/manual-memory-management-golang-jemalloc/, it appears that the issue stems from glibc not returning memory to the operating system enough.

Considering this, I'm curious: is there any possibility that tokenizers might be adapted to utilize alternative memory allocators like jemalloc or tcmalloc in the future?

daulet commented 3 months ago

before we discuss memory allocators, do you have a repro to examine? Tokenizer config is the biggest memory consumer, and you can release that memory with Tokenizer.Close(). Are you reusing tokenizer struct or are you creating multiple instances?