Closed max-fry-apps closed 1 year ago
@ankane, @petergoldstein
Hey @max-fry-apps, try the latest version from GitHub (gem "tokenizers", github: "ankane/tokenizers-ruby"
) to see if that makes a difference. If not:
@ankane
I've found a proper tokenizer for my purpose. It turned out to be gpt-neox-20b
.
I use it like this Tokenizers.from_pretrained("EleutherAI/gpt-neox-20b")
and it works perfectly.
Thanks!
First of all, thanks for such a great gem.
I'm trying to use it for my hobby project and don't have much luck so far. I have the
pile_tokenizer.json
file which looks like this:I tried to use it like this:
But it seems like something is wrong with its format.
I also tried to extract
vocab
andmerges
into separate files:vocab.json
andmerges.txt
and use it as its described in the README:While it decodes individual tokens well, I struggle to make it encode a string properly.
For example, when I try to encode
Hello World
which is represented asHelloĠWorld
I expect to get these tokens:["Hello", "ĠWorld"]
, but instead I get["hell", "og", "worl", "<unk>"]
.For longer strings, it returns a lot of
<unk>
tokens. It reminds me that I see in this testIf you could help me understand what I'm doing wrong, I would appreciate it.