bbalet / stopwords

Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.
Other
143 stars 25 forks source link

Work with a byte slice instead of string #1

Closed moorereason closed 8 years ago

moorereason commented 8 years ago

Using regexp.ReplaceAllString is very inefficient. I've refactored the main clean function to work on a byte slice and called the function Clean. CleanContent has been renamed to CleanString.

benchmark old ns/op new ns/op delta BenchmarkCleanContent 21159375 6637523 -68.63%

benchmark old allocs new allocs delta BenchmarkCleanContent 7129 4195 -41.16%

benchmark old bytes new bytes delta BenchmarkCleanContent 17182896 598000 -96.52%

bbalet commented 8 years ago

Hi,

Sorry for late answer. Are you sure that byte slices are working with Unicode characters? That was the reason why I chose string instead.

moorereason commented 8 years ago

I'm not well-versed in Unicode. All of the tests pass.

bbalet commented 8 years ago

Yes I saw that now. I was a bit confused with this point.

moorereason commented 8 years ago

Since you already utilize golang.org/x/text/unicode/norm, shouldn't that protect us as long as the stopwords themselves are normalized?

bbalet commented 8 years ago

As far as i undertood, Normalization is a way to uniform the way the glyphs are composed. For example, you can write the glyph 'é' as 'é' or 'e' + put next symbol top + symbol accent. When you normalize a text you set an option according to the form you want to get (canonical or decomposed, NFC or NFD in go).

The problem that might occur is a wrong approach of iteration (e.g. byte by byte would be wrong in my example for the second case). But words are tokenized in my prototype and with the lib simhash, so we iterate on each token and not on each byte.