Separate TextEncoder and Tokenizer from Transformers.jl

VarLad commented 6 days ago

Currently I need to load a tokenizer from HuggingFace, and use it for simply encoding and decoding sentences. While doing that from Transformers.jl interface is awkward already (I had to go tok = Transformers.HuggingFace.load_tokenizer("model") |> r -> BytePairEncoding.BPEEncoder(BytePairEncoding.BPETokenizer(r.tokenizer.tokenization), r.vocab), thats its own issue), just for encoding and decoding purposes alone, loading something as big as Flux is not justified in my opinion.

Is there any way that the text encoding and tokenization can be made a part of something else, for example, TextEncodeBase (with HuggingFace api being included into HuggingFaceApi.jl extension)?

chengchingwen commented 6 days ago

It's kinda convoluted, but before that, could you elaborate on the awkwardness? It might be an important point for the overall design. So as you might already know, BPE.jl has a lightweight interface (BPEEncoder) for the encoder. From your code snippet, I guess you probably like the lightweight interface more than the transformer one. However, BPEEncoder is mostly for replicating openai/tiktoken, which is less flexible than huggingface's tokenizers. The behavior could be slightly different between the two interfaces. The question is, when you say "I need to load a tokenizer from HuggingFace", which one do you want? In the HuggingFace ecosystem, the "tokenizer" could refer to two different objects. One is the wrapper over the rust huggingface/tokenizer package, and another is a type in huggingface/transformers that wrap over the former and handle model- or training-specific functionality. The latter contains lots of assumptions based on the model, which means if we want to split the tokenizer class out, we would also need to split out about half of the HuggingFace.jl loader code. Then where should those code live would be the new problem.

VarLad commented 5 days ago

elaborate on the awkwardness

Apologies, this could be an issue on my side, but I was genuinely confused about what to use for something as simple as encoding (getting id vector from string) and decoding (getting back a string from id vector).

The behavior could be slightly different between the two interfaces. The question is, when you say "I need to load a tokenizer from HuggingFace", which one do you want?

I wanted to use the tokenizer here: https://huggingface.co/RWKV/rwkv-4-430m-pile which I assume is to be used with HuggingFace's tokenizers library.

The latter contains lots of assumptions based on the model, which means if we want to split the tokenizer class out, we would also need to split out about half of the HuggingFace.jl loader code. Then where should those code live would be the new problem.

I have an idea for this, would it make sense to make all HuggingFace tokenization stuff a part of TextEncodeBase.jl (or BPE.jl, I'll let your decide that) as extensions that depend on HuggingfaceApi.jl? Users should be able to do using HuggingFaceApi, TextEncodeBase #or BPE.jl and get access to all of Transformers.jl's tokenization related capabilities. As for loading the model weights itself, I'm not sure how you're doing this, but is the code dependent on Flux.jl? If not, it would make sense to decouple that part as well as it would make it easier for people who don't want to use Flux to load the weights and write their own inference (as I'm doing). For this we could have a separate package like HuggingFaceTransformers.jl that solely exists to load model weights from transformer model weights from HuggingFace.

While the above is one suggestion, another alternative to it is to divide up capabilities of Transformers.jl into different extensions. using Transformers, HuggingFaceApi loads up the weight loading capabilities, using Transformers, Flux loads everything related to Flux.jl, using Transformers, TextEncodeBase, HuggingFaceApi loads up tokenizer loading and working with HuggingFace tokenizers. Dividing up code like this will probably help out in the future refactoring as well as for others to contribute.

VarLad commented 5 days ago

@chengchingwen As a side note, the tokenizer I loaded using the above method isn't decoding properly. Encoder works well, decoder is giving me a string which has \u0120 character in it, when its supposed to be replaced by space during decoding "\u0120am" should be decoded as " am" and I think thats how GPT2 decoder is supposed to work too. This for some other characters as well

chengchingwen commented 4 days ago

I was genuinely confused about what to use for something as simple as encoding (getting id vector from string) and decoding (getting back a string from id vector).

It's ok, this situation itself is messy. The thing is, all the implementations can get id vector from string and get string back from id vector, but it might not be the one you want. For example, do you need pre/post-processing (e.g. batching, padding, truncation) and do you want the tokenizer to behave the same as huggingface/transformers' tokenizer? Some functions and setting are hardcoded in the python code (per model) and not serialized in the file on the hub.

As a side note, the tokenizer I loaded using the above method isn't decoding properly.

The behavior is actually expected. It involves a few implementation details. The bpe/vocab you load are operate on the transformed codepoints (e.g. with the \u0120). The code mapping/unmapping are part of the postprocessing, but BPEEncoder doesn't do any postprocessing other than simple join.

would it make sense to make all HuggingFace tokenization stuff a part of TextEncodeBase.jl (or BPE.jl, I'll let your decide that) as extensions that depend on HuggingfaceApi.jl? ...

Extension is definitely needed, but we still need a package to hold the loader code. I kinda wonder if we could move the loader code to HuggingFaceApi.jl and make it the parent module of the extensions. It would no longer be a lightweight package, but it seems Transformers.jl is the only dependent of HuggingFaceApi.jl, so probably nobody would really against it. Then we can have HuggingFaceApiTextEncodeBaseExt for the tokenizer.

chengchingwen / Transformers.jl

Separate TextEncoder and Tokenizer from Transformers.jl #199