ankane / tokenizers-ruby

Fast state-of-the-art tokenizers for Ruby
Apache License 2.0
132 stars 6 forks source link

Issue with CharBPETokenizer and pile_tokenizer.json #25

Closed max-fry-apps closed 1 year ago

max-fry-apps commented 1 year ago

First of all, thanks for such a great gem.

I'm trying to use it for my hobby project and don't have much luck so far. I have the pile_tokenizer.json file which looks like this:

{
    "addedTokens": {
        "<|endoftext|>": 0,
        "<|padding|>": 1,
        "        ": 50254,
        "    ": 50255,
        "  ": 50256
    },
    "vocab": {
        "<|endoftext|>": 0,
        "<|padding|>": 1,
        "!": 2,
        "\"": 3,
        "#": 4,
        "$": 5,
        "%": 6,
        "&": 7,
...
    "merges": [
        "Ġ Ġ",
        "Ġ t",
        "Ġ a",
        "h e",
        "i n",
        "r e",
        "o n",
        "ĠĠ ĠĠ",
        "Ġt he",
        "e r",
        "a t",
...
    ]
}

I tried to use it like this:

 tokenizer = Tokenizers.from_file('pile_tokenizer.json')

But it seems like something is wrong with its format.

I also tried to extract vocab and merges into separate files: vocab.json and merges.txt and use it as its described in the README:

tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")

While it decodes individual tokens well, I struggle to make it encode a string properly.

For example, when I try to encode Hello World which is represented as HelloĠWorld I expect to get these tokens: ["Hello", "ĠWorld"], but instead I get ["hell", "og", "worl", "<unk>"].

For longer strings, it returns a lot of <unk> tokens. It reminds me that I see in this test

expected_tokens = ["<unk>", "ca", "<unk>", "fee", "<unk>", "th", "<unk>", "m", "agi", "<unk>", "<unk>", "ca", "<unk>", "yo", "<unk>", "<unk>"]

If you could help me understand what I'm doing wrong, I would appreciate it.

max-fry-apps commented 1 year ago

@ankane, @petergoldstein

ankane commented 1 year ago

Hey @max-fry-apps, try the latest version from GitHub (gem "tokenizers", github: "ankane/tokenizers-ruby") to see if that makes a difference. If not:

  1. How was the tokenizer file generated?
  2. Are you seeing the same results with the Tokenizers Python library?
max-fry-apps commented 1 year ago

@ankane

I've found a proper tokenizer for my purpose. It turned out to be gpt-neox-20b. I use it like this Tokenizers.from_pretrained("EleutherAI/gpt-neox-20b") and it works perfectly.

Thanks!