Closed PaulCalot closed 2 years ago
Than you so much! This will fix a lot of other problems which were reported.
Hey, thanks again fro your input! This fix is now a part of 1.2.0. I will close this ticket and we can open a new one if there are any problems.
Cheers
It seems like this change introduced an infinite loop when trying to tokenize some sentences, for example "El Patrón Repositorio y sus falacias"
Hello there,
First, thanks for the very useful package.
I noticed a different behavior that is reproductible with the word "eiffel". Using HuggingFace bert-base-uncased, the tokenization yields : 'e', '##iff', '##el'. However, when using the BERT base uncased tokenizer of this package, I simply get '[UNK]'.
Since I am not sure wether you wanted it to work this way or not, I decided to make an issue out of it.
To solve it, I changed a few lines in TokenizerBase.cs, basically allowing subword of length 1 and replacing only the first occurence of the subword in word by '##' (which should be done in any case).
For the ~200 sentences I tested and compared to the HF version, it worked as expected, however I did not run any thorougher test.
Cheers,
Paul