avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
516 stars 62 forks source link

Added 'errors' flag with options such as 'replace' #53

Closed bob1sparks closed 3 years ago

bob1sparks commented 4 years ago

I added an optional errors flag parameter so that the caller has some control.

EG buf = unidecode.unidecode(textvalue, errors='preserve')

where:

'ignore' characters are dropped if no replacements are found in the tables (default) 'strict' an exception is thrown if no value is found in the tables 'replace' a ? is substituted if no replacement is found 'preserve' the existing unicode character is kept.

ignore is the default so behavior is not changed for existing projects.

this is inspired by python builtin bytes.decode

Why this change was required:

My immediate use case was to leave any characters not found in the translation tables in place rather than drop them. I am normalizing spam text such as phařm pꞧoduct to pharm product. In this case since the homoglyph 'r' in product is not found I want to leave it as pharm pꞧoduct rather than pharm poduct. The human reader makes the final adjustment in this case.

In the future I hope to add a training facility to customize the code tables.

avian2 commented 4 years ago

Hi

thanks for submitting a pull request. I have a few problems with it however:

bob1sparks commented 4 years ago

TomažThank you for considering my pull request. It was certainly a pleasure to read your project and use your project. Please forgive me not understanding the process especially the test. Although in hind site its not forgivable.If you ever want to support named arguments such as those in the standard library I would be pleased to help out.I will try linking the standard library and yours as you suggested.Thanks again for sharing your insight.BobSent from my Samsung device

-------- Original message -------- From: Tomaž Šolc notifications@github.com Date: 29-11-2019 06:44 (GMT-05:00) To: avian2/unidecode unidecode@noreply.github.com Cc: bob1sparks bob.sparks@rocketmail.com, Author author@noreply.github.com Subject: Re: [avian2/unidecode] Added 'errors' flag with options such as 'replace' (#53)

Hi thanks for submitting a pull request. I have a few problems with it however:

It breaks existing tests (see Travis log). It also adds new code without adding tests to cover it and doesn't update documentation. I think this functionality doesn't belong in the unidecode() function. First, it makes the function return conceptually different things depending on the errors parameter (which is badly named in this context: why are non-ASCII characters called "errors"?). The function now returns either an ASCII string or a Unicode string. On Python 2 these would actually be different Python types. I think it would be better not to try to replicate standard library behavior. A cleaner way would be perhaps to integrate with it, for example through the codecs.register_error mechanism (ref). Through this interface it would be possible to do for example string.encode('ASCII', errors='unidecode').

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/avian2/unidecode/pull/53?email_source=notifications\u0026email_token=AN25BJT7RIR5AAAHYTMWAGDQWD6BZA5CNFSM4JQR2T3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFOVENA#issuecomment-559764020", "url": "https://github.com/avian2/unidecode/pull/53?email_source=notifications\u0026email_token=AN25BJT7RIR5AAAHYTMWAGDQWD6BZA5CNFSM4JQR2T3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFOVENA#issuecomment-559764020", "name": "View Pull Request" }, "description": "View this Pull Request on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

avian2 commented 3 years ago

I'm tempted to close this pull request since my recent commit implemented the functionality you proposed here.

I'm less hesitant to include the errors='preserve' option now since Python 2 is very much EOL. In Python 3 Unidecode always returns a string object so my objection that the type of the return value would change with this option is no longer relevant.

I also made some effort to make a distinction in the tables between the case of "replacement is unknown" and "replace with empty string" - it's not perfect and I'm sure in the future there will be more fixes necessary to the tables in this regard. But I think it's a good start for now.

Regarding my register_error comment, I think I misunderstood what you were trying to accomplish. It seems wrong to me now that I suggested that. Sorry.

bob1sparks commented 3 years ago

Thanks for this update. We have been using a forked version. I wrote a utility program to read a text file and see if the character had a unidecode replacement. If not it would allow the user to press a key for the new replacement the write it to the unidecode configuration file. I needed this a there were lots of scam emails with specific respellings that our users required to be translated. If that's of interest I can clean it up and send it along.Please trash my pull request and I will "pull" the latest code.Once again thank you for this elegant code that solves a problem we have fighting crime. Bob On Jan 5, 2021 2:35 PM, Tomaž Šolc notifications@github.com wrote: I'm tempted to close this pull request since my recent commit implemented the functionality you proposed here. I'm less hesitant to include the errors='preserve' option now since Python 2 is very much EOL. In Python 3 Unidecode always returns a string object so my objection that the type of the return value would change with this option is no longer relevant. I also made some effort to make a distinction in the tables between the case of "replacement is unknown" and "replace with empty string" - it's not perfect and I'm sure in the future there will be more fixes necessary to the tables in this regard. But I think it's a good start for now. Regarding my register_error comment, I think I misunderstood what you were trying to accomplish. It seems wrong to me now that I suggested that. Sorry.

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe.

avian2 commented 3 years ago

I'm closing this then. If you want to contribute new character replacements, please open a new pull request. Thanks.