Open MichaelFenwick opened 3 years ago
Some thoughts on how to handle this:
It's too hard to identify apostrophes directly, letting anything that remains be considered a single quote. Beyond plural possessives (the boys' game) looking like the end of a single quoted sequence, apostrophes are also used for indicating accents (stormin', ol') in a way that isn't readily identifiable, and which can't reliably be expected to match a dictionary entry. As such, unambiguous punctuation needs to be removed from the sentence, and then some logic can be used to predict whether what remains is an apostrophe or closing single quote (CSQ).
[’']
character is between two letters and exclude that as potential closing single quote (PCSQ).[’']
characters which precede the first '
(opening single quote (OSQ) character). If OSQs are used, then anything before the first one is an apostrophe and not a CSQ.[’']
characters in the sentence. If only one is found, that must also be a contraction, and thus an apostrophe.s
(likely plural possessive) or in
(likely contraction of ing
). If all but one in the set fits this, consider that one a CSQ, and the others apostrophes.'n'
is a notable exception to this).
Single quotes need to be identified and isolated from apostrophes. They can be replaced with the appropriate Unicode single quote characters, while apostrophes can remain the apostrophe character (or changed from closing single quote to apostrophe if it started as one).