berniey / hanziconv

Hanzi Converter for Traditional and Simplified Chinese
Other
180 stars 43 forks source link

了usually shouldn't turn into 瞭 when converting to traditional #8

Closed suzil closed 4 years ago

suzil commented 7 years ago

他感冒了 converts to 他感冒瞭 instead of 他感冒了 when converting to traditional.

ericmjl commented 7 years ago

@suzil @berniey I have found the same thing happening on my projects.

I think a few Chinese-speaking communities are using a restricted subset of traditional characters; for example our church does not convert the charcater 祢.

Is there a path towards using a custom subset of characters, say, by specifying the ones we don't want converted? I'm happy to hack on a PR to make this happen. As the creator of the package, @berniey, I'd like to hear your thoughts on the right path, so that it's any PR is easy for you to maintain.

berniey commented 7 years ago

Sorry for the late reply folks,

Using custom charset is actually a great idea Eric :). I will try to first validate and fix the issue reported by Susannah in the next few days.

My original roadmap to the package was to do content based conversion, but never have chance to finish it. Your suggestion to use custom charset will actually serve as a good middle ground, as user can install customized fix under their situation. Some points below for consideration:

Regards, Bernard

On Sat, Aug 19, 2017 at 1:40 PM, Eric Ma notifications@github.com wrote:

@suzil https://github.com/suzil @berniey https://github.com/berniey I have found the same thing happening on my projects.

I think a few Chinese-speaking communities are using a restricted subset of traditional characters; for example our church does not convert the charcater 祢.

Is there a path towards using a custom subset of characters, say, by specifying the ones we don't want converted? I'm happy to hack on a PR to make this happen. As the creator of the package, @berniey https://github.com/berniey, I'd like to hear your thoughts on the right path, so that it's any PR is easy for you to maintain.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berniey/hanziconv/issues/8#issuecomment-323546633, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQ6lUlxcdTO-vqBECjzbSCBDHxitz0iks5sZ0hTgaJpZM4O1od5 .

ericmjl commented 7 years ago

@berniey Thanks for the reply!

In terms of supplying a custom charset, I had some "hacky" ideas. Would love to get your input on it.

Inspecting your code in hanziconv.py, at the class definition for HanziConv, I was thinking the following ideas (of course, pending your approval/disapproval of it):

Proposal Part 1: Change the charmap from two strings to two dictionaries. This will allow for the use of dictionary .update(d) methods to allow for end-users to provide a custom map.

What this will look like might be the following:

In charmap.py Line 169, instead of doing:

simplified_charmap = cuhk_simplified + extra_simplified
traditional_charmap = cuhk_traditional + extra_traditional

We change this to:

simplified = cuhk_simplified + extra_simplified
traditional = cuhk_traditional + extra_traditional

simp_to_trad = dict(s:t for t, s in zip(simplified, traditional))
trad_to_simp = dict(t:s for t, s in zip(simplified, traditional))

Proposal Part 2: Change the underlying implementation of conversion in hanziconv.py Line 51.

It currently is:

from .charmap import simplified_charmap, traditional_charmap

class HanziConv(object):
    """This class supports hanzi (漢字) convention between simplified and
    traditional format"""
    __traditional_charmap = traditional_charmap
    __simplified_charmap = simplified_charmap

    @classmethod
    def __convert(cls, text, toTraditional=True):
        """Convert `text` to Traditional characters if `toTraditional` is
        True, else convert to simplified characters
        :param text:           data to convert
        :param toTraditional:  True -- convert to traditional text
                               False -- covert to simplified text
        :returns:              converted 'text`
        """
        if isinstance(text, bytes):
            text = text.decode('utf-8')

        fromMap = cls.__simplified_charmap
        toMap = cls.__traditional_charmap
        if not toTraditional:
            fromMap = cls.__traditional_charmap
            toMap = cls.__simplified_charmap

        final = []
        for c in text:
            index = fromMap.find(c)
            if index != -1:
                final.append(toMap[index])
            else:
                final.append(c)
        return ''.join(final)

I propose changing it to:

from .charmap import simp_to_trad, trad_to_simp  # we have two dictionaries.

class HanziConv(object):
    """This class supports hanzi (漢字) convention between simplified and
    traditional format"""
    __traditional_charmap = traditional_charmap
    __simplified_charmap = simplified_charmap

    @classmethod
    def __convert(cls, text, toTraditional=True):
        """Convert `text` to Traditional characters if `toTraditional` is
        True, else convert to simplified characters
        :param text:           data to convert
        :param toTraditional:  True -- convert to traditional text
                               False -- covert to simplified text
        :returns:              converted 'text`
        """
        if isinstance(text, bytes):
            text = text.decode('utf-8')

        mapper = simp_to_trad  # change made here
        if not toTraditional:
            mapper = trad_to_simp  # change made here

        final = []
        for c in text:
            if c in mapper.keys():
                final.append(mapper[c])  # change made here
            else:
                final.append(c)
        return ''.join(final)

Finally, end users can supply a custom list of characters that they would like to preserve (i.e. not converted). Modifying the above code block, here's one way I was thinking of doing it (but as usual, open to ideas for changing):

from .charmap import simp_to_trad, trad_to_simp  # change made here

class HanziConv(object):
    """This class supports hanzi (漢字) convention between simplified and
    traditional format"""
    __traditional_charmap = traditional_charmap
    __simplified_charmap = simplified_charmap

    @classmethod
    def __convert(cls, text, toTraditional=True, preserve=None):
        """Convert `text` to Traditional characters if `toTraditional` is
        True, else convert to simplified characters
        :param text:           data to convert
        :param toTraditional:  True -- convert to traditional text
                               False -- covert to simplified text
        :returns:              converted 'text`
        """
        if isinstance(text, bytes):
            text = text.decode('utf-8')

        mapper = simp_to_trad  # change made here
        if not toTraditional:
            mapper = trad_to_simp  # change made here

        if preserve:
            assert isinstance(preserve, str), \
                'Preserve should be a string of characters'
            mapper.update({c:c for c in preserve})  # change made here

        final = []
        for c in text:
            if c in mapper.keys():
                final.append(mapper[c])  # change made here
            else:
                final.append(c)
        return ''.join(final)

I believe what this will do is enable an end user to use a custom preservation mapping whenever they want or don't want to. An example usage would be below.

>>> hcv = HanziConv()
>>> text = '祢是我的一切,我的荣耀,我的盾牌'
>>> hcv.toTraditional(text, preserve='祢')
祢是我的一切,我的榮耀,我的盾牌

With this API, only one character was transformed, out of two possible.

If you're agreeable to this minor API change, I'm happy to modify a fork and put in a PR for this - do let me know!