epfml / disco

Decentralized & federated privacy-preserving ML training, using p2p networking, in JS
Apache License 2.0
131 stars 24 forks source link

Add tokenization support to Disco LLMs #646

Closed JulienVig closed 3 months ago

JulienVig commented 4 months ago

Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.

Full tokenizer support would allow:

JulienVig commented 4 months ago

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased'); const { input_ids } = await tokenizer('I love transformers!');



* [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) is another alternative which @peacefulotter used in their experiments. The `gpt-tokenizer` extends OpenAI's tiktoken library of GPT tokenizers only. 

* For Llama models [`llama-tokenizer-js`](https://github.com/belladoreai/llama-tokenizer-js) implements the SentencePiece BPE algorithm. This implementation is the [basis for the `Transformers.js`'s tokenizer implementation](https://github.com/belladoreai/llama-tokenizer-js/issues/9)

However, none of these libraries offer tokenizer's training capabilities.