botisan-ai / gpt3-tokenizer

Isomorphic JavaScript/TypeScript Tokenizer for GPT-3 and Codex Models by OpenAI.
MIT License
171 stars 19 forks source link

Fix encoding strings containing js object inherited properties #6

Closed adamnyberg closed 1 year ago

adamnyberg commented 1 year ago

Hi @lhr0909,

I found a minor bug that causes the tokenizer.encode() to fail if you pass in code that contains strings that are equal to the inherited properties of javascript objects.

See the included test for one example where the string toString causes problem:

it('works with javascript object property strings', () => {
    const tokenizer = new GPT3Tokenizer({ type: 'codex' });
    const str = 'some_code toString some_more_code';
    const encoded = tokenizer.encode(str);
    expect(encoded.bpe).toEqual([11246, 62, 8189, 284, 10100, 617, 62, 3549, 62, 8189]);
    expect(tokenizer.decode(encoded.bpe)).toEqual(str);
  });

Example error:

TypeError: this.bpe(...).split is not a function
at GPT3NodeTokenizer.encode (gpt3-tokenizer/dist/gpt3-tokenizer.cjs.development.js:190:41)

Please let me know if I need to adapt anything with the PR.

Thank you ✌🏻 Adam

lhr0909 commented 1 year ago

@adamnyberg Hey Adam, thanks for fixing it! I am merging this into main branch and release a new version.