ceifa / tiktoken-node

OpenAI's tiktoken but with node bindings
103 stars 10 forks source link

How does this compare to @dqbd/tiktoken #12

Closed transitive-bullshit closed 6 months ago

transitive-bullshit commented 1 year ago

https://github.com/dqbd/tiktoken

For reference, I previously tested a bunch of Node.js tokenizers for accuracy and perf here: https://github.com/transitive-bullshit/compare-tokenizers

cc @dqbd

Thanks! 🙏

dqbd commented 1 year ago

Disclaimer: I'm working on https://github.com/dqbd/tiktoken

Hi! I have extended the compare-tokenizers benchmark to include tiktoken-node and directly compare @dqbd/tiktoken against tiktoken-node:

Tested on M1 Pro (arm64), 16 GB memory, Node v19.8.1.

(index) Task Name Average Time (ms) Variance (ms)
0 'gpt3-tokenizer' 27444 78653
1 'gpt-3-encoder' 16506 62548
2 '@dqbd/tiktoken gpt2' 4194 626
3 'tiktoken-node gpt2' 3923 53
4 '@dqbd/tiktoken text-davinci-003' 4117 58
5 'tiktoken-node text-davinci-003' 3840 61

(Reordered @dqbd/tiktoken and tiktoken-node for clarity, PR can be found here: https://github.com/transitive-bullshit/compare-tokenizers/pull/1)

As we can see, tiktoken-node can be faster than @dqbd/tiktoken-node, but, as far as I have measured, not 5-6x faster, as claimed (https://github.com/openai/tiktoken/issues/22#issuecomment-1472901919).


I've considered some other cases as well, in case I've missed something:

What if we create a new instance for every iteration (out of 25)?

{
  label: '@dqbd/tiktoken gpt2',
  encode: (i: string) => {
    const tiktokenGpt2 = get_encoding('gpt2')
    const result = tiktokenGpt2.encode(i)
    tiktokenGpt2.free()
    return result
  },
},
{
  label: 'tiktoken-node gpt2',
  encode: (i: string) => {
    const tiktokenNode = TiktokenNode.getEncoding('gpt2')
    return tiktokenNode.encode(i)
  },
}
(index) Task Name Average Time (ms) Variance (ms)
0 '@dqbd/tiktoken gpt2' 227556 49932
1 '@dqbd/tiktoken text-davinci-003' 219893 12080
2 'tiktoken-node gpt2' 287464 241397
3 'tiktoken-node text-davinci-003' 303615 690935

What if we add a much larger fixture in the test suite?

fixtures.push(fixtures[fixtures.length - 1].repeat(100))
(index) Task Name Average Time (ms) Variance (ms)
0 '@dqbd/tiktoken gpt2' 240056 33911
1 '@dqbd/tiktoken text-davinci-003' 238467 58049
2 'tiktoken-node gpt2' 231459 114094
3 'tiktoken-node text-davinci-003' 228842 126852

Update 9/4/2023: Ran the tests again with iterations: 25

Maybe I'm missing something else here? Would it be possible to share and compare benchmarks as well, @ceifa?


Functionality wise, @dqbd/tiktoken supports more environments (Edge Functions - with ./lite we even fit within the 1 MB limit) and platforms (browsers - only WASM will be supported). There is some merit in the NAPI approach though, as parallelisation is actually supported in NAPI, thus it might be useful to dig deeper into it.

ceifa commented 1 year ago

Hey @dqbd 👋 I made my benchmark in a simple project I was building, so yes, my result can be very innacurate. Indeed your project have much more features and is much more maintained than mine!

Thanks for adding my project on @transitive-bullshit benchmark, so I can try to improve it with things like parallelism, as you said. I think we can work together to delivery the best of the two worlds, since my project can be faster sometimes, and your project can be ran anywhere.