diasks2 / pragmatic_segmenter

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
MIT License
549 stars 55 forks source link

Infinite Loop #56

Open censored-- opened 5 years ago

censored-- commented 5 years ago

Hi,

When I use this great tool for preprocessing wikipedia dumps, I encountered the infinite loop and failed with NoMemoryError.

Example:

When we input

'' (a '\0 !\0')

with "en" option to pragmatic segmenter, sub_4 = sub_characters(sub_3, '!', '&ᓴ&') at https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/punctuation_replacer.rb#L55 causes the infinite loop.

I'm wondering if we can solve this problem by escaping '\0' in sub_characters function.

def sub_characters(string, char_a, char_b)
      sub = string.gsub(char_a, char_b).gsub('\\0', '\\\\\0')
      @text.gsub!(/#{Regexp.escape(string)}/, sub)
      sub
end

Thanks!

wflanagan commented 5 years ago

We have this same problem, though I haven't managed to figure out the character sequence that is causing it. I'll try doing your gsub and see if it fixes it.

dbourget commented 6 months ago

We've encountered this problem as well. This can be fixed by replacing:

@text.gsub!(/#{Regexp.escape(string)}/, sub)

By:

@text.gsub!(string, sub)

There is no need to use a regexp since we want exact match. I would submit a PR but this package seems unmaintained judging by the age and seriousness of the issues.