Entomy / LibLangly

The combined Langly runtime
https://entomy.github.io/LibLangly/
33 stars 7 forks source link

Word Boundary detection #156

Open Entomy opened 4 years ago

Entomy commented 4 years ago

Methods like Words() are supposed to be splitting... words. But they don't. They split on spaces, which isn't necessarily the only boundary. Also, Words() should be removing non word components, but it's not.

In order to do this, a proper implementation of word boundary detection is required. UAX 21.4 describes this.

Entomy commented 4 years ago

this and this describe an issue with zwsp along with the debate around it. I've settled on a solution involving keeping the Cf classification instead of Zs, but also ensuring that it is detected as a word boundary. So zwsp (U+200B) absolutely must be recognized that way.

Entomy commented 4 years ago

Appologies for the transfer spam. This definately belongs here now.