dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
705 stars 53 forks source link

irrecoverable crash after multiple calls to encoding_for_model() #35

Closed nsbradford closed 1 year ago

nsbradford commented 1 year ago

(JS) Tiktoken consistently and irrecoverably crashes if you call encoding_for_model() too many times. If you have a long-running process, you may need to instantiate encodings many times, and creating a global or passing around a single encoding is not a good solution.

import { encoding_for_model } from '@dqbd/tiktoken';

// fails. Iteration ~188 is noticably slower, the next iteration fully crashes:
for (let i = 0; i < 1000; i++) {
  console.log(`Iteration ${i}...`);
  const encoding = encoding_for_model('gpt-4'); // this call fails
  const result = encoding.encode('Hello, world!');
}

Error:

TypeError: The encoded data was not valid for encoding utf-8

  at getStringFromWasm0 (../../node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:111:30)
  at Object.<anonymous>.module.exports.__wbindgen_error_new (../../node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:398:27)
  at null.<anonymous> (wasm:/wasm/00b63e2e:1:168944)
  at Object.<anonymous>.module.exports.encoding_for_model (../../node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:175:14)
  at Object.<anonymous> (src/stdlib/utils/tokenize.test.ts:19:42)

Tiktoken works fine if I instantiate an encoding only once:

import { encoding_for_model } from '@dqbd/tiktoken';

// This works fine:
const encoding = encoding_for_model('gpt-4');
for (let i = 0; i < 1000; i++) {
  console.log(`Iteration ${i}...`);
  const result = encoding.encode('Hello, world!');
}

Simply catching the error and trying again also doesn't work, as future calls to encoding_for_model will either also fail (with same error), or if not calling encode with it will give you an unreachable error:

// Trying to catch the error doesn't help; all future calls will immediately crash
for (let i = 0; i < 200; i++) {
  try {
    console.log(`Iteration ${i}: call encoding_for_model()...`);
    const encoding = encoding_for_model('gpt-4');
    console.log(`Iteration ${i}: call encode()...`);
    const result = encoding.encode('Hello, world!');
  } catch (e) {
    console.log('caught error!');
    console.log(e);
  }
}

Error:

RuntimeError: unreachable
    at wasm://wasm/00b63e2e:wasm-function[573]:0x6b4e6
    at wasm://wasm/00b63e2e:wasm-function[680]:0x70eb1
    at wasm://wasm/00b63e2e:wasm-function[767]:0x71fe8
    at wasm://wasm/00b63e2e:wasm-function[236]:0x5cafe
    at wasm://wasm/00b63e2e:wasm-function[200]:0x4e365
    at wasm://wasm/00b63e2e:wasm-function[34]:0x1f88d
    at wasm://wasm/00b63e2e:wasm-function[154]:0x48bac
    at Tiktoken.encode (/Users/nickbradford/dev/rewriter/node_modules/@dqbd/tiktoken/tiktoken_bg.cjs:257:18)
    at Object.<anonymous> (/Users/nickbradford/dev/rewriter/packages/sdk/src/stdlib/utils/tokenize.test.ts:51:29)

Running on an M2 Macbook Pro. tiktoken-node does not appear to have this issue (though it has a separate crash preventing lots of instantiations https://github.com/ceifa/tiktoken-node/issues/15)

dqbd commented 1 year ago

Hello @nsbradford!

I think the issue here stems from the lack of free() when creating encoders in a loop. The follow code should be valid.

import { encoding_for_model } from '@dqbd/tiktoken';

for (let i = 0; i < 1000; i++) {
  console.log(`Iteration ${i}...`);
  const encoding = encoding_for_model('gpt-4'); // this call fails
  const result = encoding.encode('Hello, world!');
  encoding.free();
}

Future versions will attempt to address the issue by using WeakRefs, so that the encoder will unload itself, but that is not the case at the moment.

nsbradford commented 1 year ago

Thanks @dqbd - verified free() works.