ankane / tokenizers-ruby

Fast state-of-the-art tokenizers for Ruby
Apache License 2.0
132 stars 6 forks source link

Add optional punctuation cleanup during decoding - clean_up_tokenization equivalent #33

Closed w-zygmuntowicz closed 7 months ago

w-zygmuntowicz commented 7 months ago

Hi there!

I'm really glad you have made this gem available It's just the best. It's just I have that small issue – when you try to decode previously encoded string containing punctuation marks there are extra spaces added before them.

Eg.

tokenizer = Tokenizers.from_pretrained(TOKENIZER_ID)
encoding = tokenizer.encode("Who are you?")
tokenizer.decode(encoding.ids) # => "Who are you ?"

I have made a small research on how it's made in the python implementation and there is a method called clean_up_tokenization it's being called in the last if statement of the _decode method.

So I have a few questions:

  1. Would you consider adding a similar feature to this gem for optional punctuation cleanup?
  2. If so, would it be better to implement this cleanup logic in Ruby or Rust?

I'm happy to contribute to a solution if this change aligns with your project goals. I'm comfortable working in either Ruby or Rust. Please let me know how you'd like to proceed!

ankane commented 7 months ago

Hi @w-zygmuntowicz, which tokenizer are you using (if it's a public one)? It looks like Tokenizers has this functionality built-in.

w-zygmuntowicz commented 7 months ago

Ahh I see, it seems I'm using a BPE Decoder. It's a public model based on HerBERT for the Polish language.

So, correct me if I'm wrong, this issue should probably be in the other repository?

Edit:

Being more specific I'm using that model from HuggingFace ipipan/silver-retriever-base-v1

ankane commented 7 months ago

I'd first check if the Python output matches the Ruby output.

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('ipipan/silver-retriever-base-v1')
encoding = tokenizer.encode('...')
print(tokenizer.decode(encoding.ids))

If so, it might be better addressed upstream.

w-zygmuntowicz commented 7 months ago

I have checked that and the outputs don't match:

input ruby python (tokenizers) python (transformers)
How are you? How are you ? How are you ? <s>How are you? </s>

Edit:

Oh wait I have been using another python package. I used transformers not tokenizers. Let me check the results again.

Okay I have updated the table it looks like the output is the same. I was confused, because I was looking at another python package. I'm closing the issue now. Thank you for your work and help!

w-zygmuntowicz commented 7 months ago

Looks like it's resolved! :sweat_smile: