Open censored-- opened 5 years ago
We have this same problem, though I haven't managed to figure out the character sequence that is causing it. I'll try doing your gsub and see if it fixes it.
We've encountered this problem as well. This can be fixed by replacing:
@text.gsub!(/#{Regexp.escape(string)}/, sub)
By:
@text.gsub!(string, sub)
There is no need to use a regexp since we want exact match. I would submit a PR but this package seems unmaintained judging by the age and seriousness of the issues.
Hi,
When I use this great tool for preprocessing wikipedia dumps, I encountered the infinite loop and failed with NoMemoryError.
Example:
When we input
with "en" option to pragmatic segmenter,
sub_4 = sub_characters(sub_3, '!', '&ᓴ&')
at https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/punctuation_replacer.rb#L55 causes the infinite loop.I'm wondering if we can solve this problem by escaping '\0' in sub_characters function.
Thanks!