Closed KarlieZhao closed 2 years ago
Tokenizer is not handling dashes correctly:
"To form a dash, type two hyphens—without a space before, after, or between them." should be tokenized as
['To', 'form', 'a', 'dash', ',', 'type', 'two', 'hyphens‘, ’—‘ , 'without', 'a', 'space', 'before', ',', 'after', ',', 'or', 'between', 'them', '.']
currently it's tokenized as
['To', 'form', 'a', 'dash', ',', 'type', 'two', 'hyphens-', '-', 'without', 'a', 'space', 'before', ',', 'after', ',', 'or', 'between', 'them', '.']
dash unicode: U+2012, U+2013, U+2014
thanks!
be careful of smart quotes (in code, but also tickets):
@KarlieZhao status ?
done
Tokenizer is not handling dashes correctly:
"To form a dash, type two hyphens—without a space before, after, or between them." should be tokenized as
currently it's tokenized as
dash unicode: U+2012, U+2013, U+2014