diasks2 / pragmatic_segmenter

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
MIT License
546 stars 55 forks source link

Ellipses and design decision #29

Open christian-storm opened 7 years ago

christian-storm commented 7 years ago

I've been testing the ellipsis rules with . . . replaced with U+2026 (…) and find that pragmatic segmenter fails when given the actual ellipsis character. I'm probably missing something but shouldn't ellipsis.rb contain rules for the actual ellipsis character?

This brings up a bigger question of how all the variants of symbols are covered. I notice that certain end punctuation characters are explicitly defined, e.g., U+FF1F (?) in punctuation_replacer.rb. However, there are many Unicode characters that could stand in for their ASCII equivalents, e.g., U+FE56 (﹖), U+FE16 (︖), etc. for question marks or U+2047 (⁇), U+2048 (⁈), etc. for double end punctuation and so on for all symbols that are used in segmenting decisions, e.g., (), [], -, ., ... Chasing all these down seems like a nightmare!

Couldn't it make sense to convert everything to ASCII, i.e, unidecode, segment, and then replace the decoded characters with their original characters? This assumes that all 'equivalent' characters have the same meaning but I believe they do, e.g., ፧ is the Ethiopic question mark which carries the same linguistic meaning as in English. If not, those could be the exceptions rather than the rule.

I would love to hear your thoughts.

Thanks for the great library...from my testing it performs better than spacy, segtok, CoreNLP, and Punkt on English wikipedia data.

christian-storm commented 7 years ago

Digging a bit deeper into this it seems that some symbols do not map to equivalent characters, e.g., dec=172 unicode=¬ ascii=! (Mathematical not symbol). A better approach would be to convert only the unicode characters that influence segmentation to their ASCII equivalent, e.g., applicable FULL STOP, like you do in punctuation_replacer.rb.

Thanks ahead of time for any feedback.

diasks2 commented 7 years ago

Hi Christian,

Thanks for checking out the library and thanks for the feedback and ideas. If you could provide some sample sentences where the ellipsis is failing that would be helpful. I'll add those to the test suite and update the gem.

More generally, to answer your question:

"However, there are many Unicode characters that could stand in for their ASCII equivalents, e.g., U+FE56 (﹖), U+FE16 (︖), etc. for question marks or U+2047 (⁇), U+2048 (⁈), etc. for double end punctuation and so on for all symbols that are used in segmenting decisions, e.g., (), [], -, ., ... Chasing all these down seems like a nightmare!"

My goal with this gem is/was using it on common texts (i.e. things you would find on Wikipedia, but not necessarily things you would find on Twitter), so I only went so far down the rabbit hole. Characters that are not so commonly used are not yet accounted for. I tried to be pragmatic ;-)

Happy to accept any PR though that would help improve the gem so that it can handle a wider range of unicode characters that might influence segmenting decisions. Even just a list of failing test cases that I can add to the test suite and then get to when I can would be helpful and appreciated.

arademaker commented 4 years ago

Do you have any paper describing the techniques used in the tool?

diasks2 commented 4 years ago

Do you have any paper describing the techniques used in the tool?

No. It is mainly regular expressions. The README is the best resource for information.