Closed moorereason closed 8 years ago
Hi,
Sorry for late answer. Are you sure that byte slices are working with Unicode characters? That was the reason why I chose string instead.
I'm not well-versed in Unicode. All of the tests pass.
Yes I saw that now. I was a bit confused with this point.
Since you already utilize golang.org/x/text/unicode/norm
, shouldn't that protect us as long as the stopwords themselves are normalized?
As far as i undertood, Normalization is a way to uniform the way the glyphs are composed. For example, you can write the glyph 'é' as 'é' or 'e' + put next symbol top + symbol accent. When you normalize a text you set an option according to the form you want to get (canonical or decomposed, NFC or NFD in go).
The problem that might occur is a wrong approach of iteration (e.g. byte by byte would be wrong in my example for the second case). But words are tokenized in my prototype and with the lib simhash, so we iterate on each token and not on each byte.
Using
regexp.ReplaceAllString
is very inefficient. I've refactored the main clean function to work on a byte slice and called the functionClean
.CleanContent
has been renamed toCleanString
.benchmark old ns/op new ns/op delta BenchmarkCleanContent 21159375 6637523 -68.63%
benchmark old allocs new allocs delta BenchmarkCleanContent 7129 4195 -41.16%
benchmark old bytes new bytes delta BenchmarkCleanContent 17182896 598000 -96.52%