Words surrounded by backwards quotation marks causing inaccurate tokenization results

NMZivkovic / BertTokenizers

Open source project for BERT Tokenizers in C#.

MIT License

83 stars 22 forks source link

Words surrounded by backwards quotation marks causing inaccurate tokenization results #17

Open rghavimi opened 1 year ago

rghavimi commented 1 year ago

It seems that the occurrence of a backwards quotation marks (“end“) in the text causes different tokenization results compared to Python implementations. This is the only inconsistency I've run into thus far. Curious if anyone else has seen similar issues.

Example: “ends -> tokenizes to ##end and ##s instead of ##ends