Closed w-zygmuntowicz closed 7 months ago
Hi @w-zygmuntowicz, which tokenizer are you using (if it's a public one)? It looks like Tokenizers has this functionality built-in.
Ahh I see, it seems I'm using a BPE Decoder. It's a public model based on HerBERT for the Polish language.
So, correct me if I'm wrong, this issue should probably be in the other repository?
Edit:
Being more specific I'm using that model from HuggingFace ipipan/silver-retriever-base-v1
I'd first check if the Python output matches the Ruby output.
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('ipipan/silver-retriever-base-v1')
encoding = tokenizer.encode('...')
print(tokenizer.decode(encoding.ids))
If so, it might be better addressed upstream.
I have checked that and the outputs don't match:
input | ruby | python (tokenizers) | python (transformers) |
---|---|---|---|
How are you? |
How are you ? |
How are you ? |
<s>How are you? </s> |
Edit:
Oh wait I have been using another python package. I used transformers
not tokenizers
. Let me check the results again.
Okay I have updated the table it looks like the output is the same. I was confused, because I was looking at another python package. I'm closing the issue now. Thank you for your work and help!
Looks like it's resolved! :sweat_smile:
Hi there!
I'm really glad you have made this gem available It's just the best. It's just I have that small issue – when you try to decode previously encoded string containing punctuation marks there are extra spaces added before them.
Eg.
I have made a small research on how it's made in the python implementation and there is a method called clean_up_tokenization it's being called in the last if statement of the _decode method.
So I have a few questions:
I'm happy to contribute to a solution if this change aligns with your project goals. I'm comfortable working in either Ruby or Rust. Please let me know how you'd like to proceed!