CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
MIT License
415 stars 73 forks source link

[QUESTION] About repeating characters #134

Closed ghost closed 10 months ago

ghost commented 10 months ago

Hi I'm working on cleanin an arabic dataset and it has repeating characters inside a string for example "مرحباااا" instead of "مرحبا" Is there a function in Camel tools fo this, because I read the documentation didn't find somethhing related and also Itried the command camel_arclean but still the same repeating characters. Waiting for your help.

balhafni commented 10 months ago

Hello, we currently do not support this. But this can be accomplished by using something like the function below:

import re
def remove_repetitions_ar(s, policy=1):
    """Reduces the repeated characters (more than two repeated) 
    from an Arabic string to one or two characters based on the 
    optional specified policy.
    Args:
        s (:obj:`str`): The string to be normalized.
        policy (:obj:`int`, optional):
            The reduction policy. If policy=`1` the repeated characters will
            be reduced to `1` character. If policy=`2` the repeated characters
            will be reduced to `2` characters. Defaults to `1`.
    Returns:
        :obj:`str`: The normalized string.
    """

    _REP_AR_RE = re.compile(r'(.)\1{2,}')

    if policy == 1:
        return _REP_AR_RE.sub(u'\\1', s)
    elif policy == 2:
        return _REP_AR_RE.sub(u'\\1\\1', s)
    else:
        raise ValueError("Policy value should be either 1 or 2!")

remove_repetitions_ar('مرحباااا')
'مرحبا'

Hope this is helpful.

ghost commented 10 months ago

yes it helps a lot why I asked because I saw in the docs the module camel_tools.morphology.errors.MorphologyError So I was thinking may be this module is for errors like the repeating characters. but unfortenatly the docs don't have enough examples. so is there any module in camel tools that check grammar or orthographic errors and correct it?