fnl / syntok

Text tokenization and sentence segmentation (segtok v2)
MIT License
200 stars 34 forks source link

Segmenting sentences at colons #9

Open fhamborg opened 4 years ago

fhamborg commented 4 years ago

For example the following snippet will be extracted as one single sentence (ending at the last full stop), but it should perhaps be split at the colons.

Here they “warn” anyone who opposes his radical ideology:
Four police officers were sent to hospital:
Violence against police officers is not only acceptable with Bernie Sanders and Black Lives Matter terrorists, its necessary to create chaos and panic:
What kind of violent protest would be complete without Barack Obama’s good friend, domestic terrorist Bill Ayers:
It’s probably just a coincidence that on a day that <u><b>Obama</b></u> was too busy to attend Nancy Reagan’s funeral, he was able to address a crowd about his hate for Trump only hours before this organized chaos in Chicago:
And finally, we’re wondering how much our Organizer In Chief had to do with this Alinsky style chaos in Chicago:
Illegal aliens, paid Soros protesters, angry Black Lives Matter terrorists inspired by Obama’s race war and Bernie Sanders supporters who have absolutely no idea why they showed up, sent four innocent police officers to the hospital; prevented thousands of innocent Americans from exercising their First Amendment right.

Is this by intention? Is there a way to force splitting at colons? Besides this extreme example I think I came across many cases where syntok did not split at colons.

fnl commented 4 years ago

Thank you, Felix, for bringing this up; A valid feature request: Colon (and semi-colon) handling is indeed a bit of a borderline affair, and technically they are sentence separators. It might make sense to support that, but I need to think about it a bit more. I'd also love to hear feedback/oppinions from other users about this.

[Correcting the title of and adding labels.]

fhamborg commented 4 years ago

Yea I agree, whether segmentation is sensible at colon and semicolon likely also depends on the text domain. Looking at the definition of each in Wikipedia, one finds that both have cases, where segmentation would be required and others, where not.

E.g., for semicolon (cf. Wikipedia): "The semicolon or semi-colon[1] (;) is a punctuation mark that separates major sentence elements. A semicolon can be used between two closely related independent clauses, provided they are not already joined by a coordinating conjunction. Semicolons can also be used in place of commas to separate the items in a list, particularly when the elements of that list contain commas."

Yet, at least for the colon, I found that nltk and CoreNLP actually do perform segmentation more often than not (if not always?).

nmstoker commented 4 years ago

My two cents: those examples aren't really separate sentences because of the colons, they're separate sentences due to the content of the sentence, and they just happen to have the (very odd) colons at the end. It's not normal English usage to end a sentence with a colon, in fact it actively implies some following content. Therefore I would tend not to expect it to split on a colon and would prefer that was left to people to deal with if there are special cases with their particular text source.

However, with a semi-colon I would be more open to the idea that they can be treated as separate sentences. It's not uncommon for editors looking to simplify text to turn such cases into two (or more) distinct sentences and it would be less surprising here than it would be with the colon case.

fnl commented 4 years ago

In general, libraries such as nltk and CoreNLP tend to severely over-split, which was the major reason for me to come up with my own. Hence, I agree, adding semicolons as potential markers could be interesting, while it seems unwise to elevating colons to official markers, too.

fhamborg commented 4 years ago

Hence, I agree, adding semicolons as potential markers could be interesting, while it seems unwise to elevating colons to official markers, too.

This seems feasible to me.

fnl commented 4 years ago

Release 1.3.1 now supports semi-colon segmentation.

I will leave this ticket open, however, as this was specifically about segmenting colons.