Closed JulienVig closed 3 months ago
Transformers.js
may be an efficient off-the-shelf solution for adding pre-trained tokenizer support in Disco.
The library extends the HuggingFace library to JavaScript, including integrating tokenizer support:
import { AutoTokenizer } from '@xenova/transformers';
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased'); const { input_ids } = await tokenizer('I love transformers!');
* [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) is another alternative which @peacefulotter used in their experiments. The `gpt-tokenizer` extends OpenAI's tiktoken library of GPT tokenizers only.
* For Llama models [`llama-tokenizer-js`](https://github.com/belladoreai/llama-tokenizer-js) implements the SentencePiece BPE algorithm. This implementation is the [basis for the `Transformers.js`'s tokenizer implementation](https://github.com/belladoreai/llama-tokenizer-js/issues/9)
However, none of these libraries offer tokenizer's training capabilities.
Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.
Full tokenizer support would allow: