avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
516 stars 62 forks source link

Handling of conversions near punctuation #81

Closed RichardForshaw closed 1 year ago

RichardForshaw commented 1 year ago

I recently upgraded unidecode, and saw some failing test.

The test in question:

"Pickup 65” TV from Platform 9¾, Kingʹs Cross Station."

The result:

- Pickup 65" TV from Platform 93/4, King's Cross Station.
+ Pickup 65" TV from Platform 9 3/4 , King's Cross Station.

I think that separating the 9 from the 3/4 is a good idea, so as to distinguish it from the possibility of 93 / 4 (which the original is not), however there is also a space placed between the 3/4 and the comma which does not read well.

Not a major issue but probably something that will bug people.

avian2 commented 1 year ago

The extra space was introduced in b8af43612f7150a0af181ee14682bdb5b9a8359d to prevent fractions from merging with adjacent numbers.

RichardForshaw commented 1 year ago

Yes, I think that adding the extra space at the start to prevent the merging is good, but I wonder if the extra trailing space is needed? I can't currently think of any examples where something deliberately adjacent & following a fraction such as ¾ would require a space separation. I expect other trailing numbers would already have a space in the original string. (But preceding numbers may not, in which case I agree introducing a space there is a good thing).

avian2 commented 1 year ago

@IamJeffG Since you contributed the commit that added the spaces, do you have any objection to removing the trailing space?

IamJeffG commented 1 year ago

I have no objection to that change. I do often deal with ranges like "¼–½" but I'm equally fine to receive "1/4-1/2" as "1/4 - 1/2".

If anyone is out there who's parsing strings like "½¾" or "¾9", they would view the change as a regression, but seems very unlikely. In those cases I'm not even sure we can intuit what the expected behavior ought to be.

avian2 commented 1 year ago

Thank you both for your comments. I'm removing the trailing space in the replacements for fractions. I'll be releasing a new version of Unidecode with this change shortly.