dottxt-ai / outlines-core

Structured generation in Rust
Apache License 2.0
64 stars 1 forks source link

Move `TransformersTokenizer` back to the Python package #81

Open rlouf opened 5 hours ago

rlouf commented 5 hours ago

After https://github.com/dottxt-ai/outlines-core/pull/52, outlines-core no longer has tokenizer support, aside from the two copies of TransformerTokenizer in the test and benchmark code. What's the plan wrt. this?

If the plan is to use adapt_tokenizer to patch transformers tokenizers, it's not clear how that's an improvement over a custom tokenizer wrapper classes and a conditional transformers dependency, for example. In general, we could move TransformerTokenizer back to outlines-core and make transformers optional, then outlines-core will be usable with llama-based tokenizers and we won't need two copies for testing.

Originally posted by @brandonwillard in https://github.com/dottxt-ai/outlines-core/issues/2#issuecomment-2403490462

rlouf commented 5 hours ago

The clean solution would be to use the tokenizers crate to remove the dependency on transformers in the Python package. In the meantime, it is unreasonable to ask downstream libraries to implement their own version of adapt_tokenizer since this is always required to use the package.