fnl / syntok

Text tokenization and sentence segmentation (segtok v2)
MIT License
201 stars 34 forks source link

Adding all bible book names as abbreviations #12

Closed jakepoz closed 2 years ago

jakepoz commented 4 years ago

We've seen a lot of bible citations in our data, and wanted to add a comprehensive list of bible book names as abbreviations. This makes syntok do a better job splitting bible citations of the form

This is not a real quote? (Phil. 4:8) No, it's not.

fnl commented 4 years ago

Thank you, Jake!

Can I kindly ask you to restructure the added abbreviations by weaving them into the existing list of abbreviations, instead of changing their order completely? The current integration is making it impossible for me to spot which new abbreviations this change is in fact trying to add.

Also, there seem to be several tokens in this change that I can spot already that are not strictly abbreviations; Such as Matt, Song, or Psalm. If a sentence were to end with one of those words, that sentence would not be split. Abbreviations really should only be tokens that hardly ever would be used as proper words. Otherwise, setting those abbreviations would cause false-positive splits for many out-of-domain texts (i.e., non-bible texts in this case for this suggested change).

jakepoz commented 4 years ago

@fnl Thank you for the suggestion, just applied it.

Yeah, it's an interesting case on which things to include in this list, and which to not include. We see a pretty broad range of content come through our system, and I did notice that bible citations of this sort were often split incorrectly. They seem to always have the following form:

... is a great joy, a prized possession. (Isa. 33:6) Text continues...

But it gets split into 3 sentences:


is a great joy, a prized possession.
(Phil.
33:6) Text continues
jakepoz commented 3 years ago

Hey @fnl Just wanted to check in and see if you'd merge this in?

fnl commented 3 years ago

Hi Jake; Sorry, this had dropped off my radar. From the unit test and your above comments, I see the cases that matter to you all follow a very regular structure, namely /\([A-Z][a-z]*\. [0-9]+:[0-9]+\)/ In the former version of syntok (that is, in segtok) I supported avoiding sentence segmentation inside the parenthesis. That is important to avoid over-splitting in sci. quotes, for example (F. Leitner, 2021), too. As preventing segmentation inside parenthesis with short token sequences would resemble a more generic solution than adding domain-specific abbreviations, my question for you is: Assuming syntok would not split within parenthesis with less than n (user-configurable) tokens inside, would that resolve your segmentation issues, too?

jakepoz commented 3 years ago

Interesting suggestion, what would the default value of n be?

I guess it depends on your definition of "domain-specific" :) In our case, we see a pretty wide corpus of English text, and the Bible names appeared to be a very common thing in the language.

fnl commented 2 years ago

While old, I still think the right way to fix this issue is to implement #16 instead of adding more special case tokens. Sadly, life and work haven’t allowed me to work on that enhancement, however.

fnl commented 2 years ago

Hi Jake, sorry for the long silence here. It took some time for me to address this, but with the latest version bump to 1.4.1, syntok now correctly handles citations in quotes at the beginning of sentences, too ("Bible style"). See https://github.com/fnl/syntok/commit/b74e65e7396b0dc0794d9f343ad760d2f4a3f2d1 for details how I enabled this, and even stole your test case.

Note that there is a difference to your solution, hopefully for the better. The bible citations are segmented as wholly separate sentences now, as they semantically do not belong to the following sentence. They would belong to the preceding sentence if anything, but that is a bit too annoying to solve, so simply treating them as stand-alone sentences made the most sense.

jakepoz commented 2 years ago

Thank you, that makes a lot of sense!