JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
374 stars 96 forks source link

prepare! strip_whitespace should trim the document text at the ends #100

Closed mikesafar closed 5 years ago

mikesafar commented 6 years ago

prepare!(doc, strip_whitespace) does not trim the whitespace from the end of the text. In my view it's not just multi-whitespace characters, but also whitespace characters at the end that need to be trimmed out.

It's a small issue, but that assumption on my part led a big cascade of bugs downstream! Easily fixed with a replace(text, r"(^\s+)|(\s+$)", "") I know, but still....

mikesafar commented 6 years ago

image

cchderrick commented 5 years ago

What is the definition of strip_whitespace? Document didn't say. Here is what it does now, for example:

julia> doc = Document("  this is sample text.  Also a simple text.  ")
julia> prepare!(doc, strip_whitespace)
julia> doc.text
"this is sample text. Also a simple text. "
  1. two spaces at the start of the string => 0 space (good)
  2. two spaces after the first sentence => 1 space (good)
  3. two spaces at the end of the string => 1 space (expected?)
  4. all single spaces between words => 1 space (good)
zgornel commented 5 years ago

A simple definition would be that it replaces any occurrences of multiple bytes of value 0x20 with a single one. The whitespace at the end of the string would not qualify as 'strippable'.

oxinabox commented 5 years ago

I think the defination of strip which space should be after it is run there wille be