dhowe / rita

Website, documentation and examples for RiTa
https://rednoise.org/rita
71 stars 9 forks source link

Dashes not handled correctly in tokenizer #176

Closed KarlieZhao closed 2 years ago

KarlieZhao commented 2 years ago

Tokenizer is not handling dashes correctly:

"To form a dash, type two hyphens—without a space before, after, or between them." should be tokenized as

['To', 'form', 'a', 'dash', ',', 'type', 'two', 'hyphens‘, ’—‘ , 'without', 'a', 'space', 'before', ',', 'after', ',', 'or', 'between', 'them', '.']

currently it's tokenized as

['To', 'form', 'a', 'dash', ',', 'type', 'two', 'hyphens-', '-', 'without', 'a', 'space', 'before', ',', 'after', ',', 'or', 'between', 'them', '.']

dash unicode: U+2012, U+2013, U+2014

dhowe commented 2 years ago

thanks!

be careful of smart quotes (in code, but also tickets):

image

dhowe commented 2 years ago

@KarlieZhao status ?

KarlieZhao commented 2 years ago

@KarlieZhao status ?

done