WPBuddy / largo

A WordPress framework for news websites. Finely-crafted by INN and expertly-honed and maintained by the technology team at WP Buddy.
http://largo.wpbuddy.co
GNU General Public License v2.0
171 stars 83 forks source link

largo_trim_sentences doesn't detect "period   space" as end of sentence #1138

Open benlk opened 8 years ago

benlk commented 8 years ago

https://github.com/INN/Largo/blob/2db3552e3a523e44d92f64bb88ceaa173d48a26c/inc/post-tags.php#L520-L564

The text . , period non-breaking-space space, can occur when users insert two spaces after a period at the end of a sentence.

If Largo is trying to determine an excerpt of n sentences long, the period-space-space will not be detected as the end of a sentence. Here's a five-sentence-long 2-'sentence' excerpt:

Twenty-five years ago I was a former high school teacher who was firmly in charge of the family-run Hyde Park supermarket that I had tried to flee as a young man.  I was involved in community projects but there was nothing tugging at me more than our community’s public schools.  Local School Councils had just started, and then Catalyst, a small but interesting publication of the Community Renewal Society. In my role as Mr. G., I felt I was in the center of all things public schools.  I employed local high school students, and had a large number of CPS employees as customers, including some future CPS and CTU leaders.  I got the bug and got myself elected to the Kenwood LSC — second largest vote tally in the city I might add — but I was out of touch with what was really going on in public education. 
benlk commented 8 years ago

Other things not detected:

Note that .   is detected, because it matches /\.\s+/

benlk commented 8 years ago

Regex should check that letter after whitespace is uppercase. That's a good indication of the start of a sentence, in English at least.

benlk commented 8 years ago

This should probably include tests for largo_trim_sentences.

benlk commented 8 years ago

And at the end of this project, we might write a "Things programmers assume about sentences" post, which should include: