Issues with paragraph identification

fnl / syntok

Text tokenization and sentence segmentation (segtok v2)

MIT License

201 stars 34 forks source link

Issues with paragraph identification #4

Closed jspalink closed 5 years ago

jspalink commented 5 years ago

It looks like the segmenter. __PARAGRAPH_SEP was recently changed from regex.compile("\r?\n(?:\s*\r?\n)+") to regex.compile("\r?\n(?:\\s*\r?\n)+"). I suspect that's a mistake that snuck in with commit af5d01162e6b8493d270bb78d2bf5ccd31e96f56.

fnl commented 5 years ago

That change looks correct to me. The change in the regex pattern fixes a deprecation warning when running pytest using the old version: DeprecationWarning: invalid escape sequence \s

Both the old and the new (changed) version work for me using Python 3.6 and 3.7, but the new version no longer produces the deprecation warning.

Though, if you can give me a reproducible error report, I will happy to reopen and investigate this further.

jspalink commented 5 years ago

Ah, my bad. I'm usually using the r prefix on regex strings and was looking at this thinking it's looking for a literal \.