PyThaiNLP / pythainlp

Thai natural language processing in Python
https://pythainlp.org/
Apache License 2.0
987 stars 274 forks source link

[Suggestion] Add consonant-remover method #860

Closed konbraphat51 closed 1 year ago

konbraphat51 commented 1 year ago

Detailed description

I suggest to add a dictionary-based consonant-remover method. As like เริศศศศศศศศศศศศศศ -> เริศ

Context

I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current pythainlp.utils.normalize() removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis. Plus the implementation was a little long, so I wanted this method in pythainlp library

Possible implementation

My implementation was like below.

       #>>against เริศศศศศศศศศศศศศศ

        if (len(sentence) > 2) and pythainlp.util.isthaichar(sentence[-1]) and (sentence[-1] == sentence[-2]):
            # The last of the sentence has duplication (duplication typically at the last)

            dup = sentence[-1]

            #find the words in the dictionary that has duplication at the last
            #required here because dictio dynamically added
            repeaters = []
            for word in dictio:
                if (len(word) > 2) and (word[-1] == dup) and (word[-2] == dup):
                    all_same = True
                    for cnt_1 in range(len(word)):
                        if word[cnt_1] != dup:
                            all_same = False
                            break
                    if not all_same:
                        repeaters.append(word)

            #check if there is matching with repeaters
            sentence_head = sentence
            while(sentence_head[-1] == dup):
                if (len(sentence_head) == 1):
                    break

                sentence_head = sentence_head[:-1]

            found = False
            for repeater in repeaters:
                rep_head = repeater

                repetition = 0
                while(rep_head[-1] == dup):
                    rep_head = rep_head[:-1]
                    repetition += 1

                if sentence_head[-len(rep_head):] == rep_head:
                    found = True
                    break

            if found:
                sentences[cnt] = sentence_head + (dup * repetition)
            else:
                sentences[cnt] = sentence_head + (dup * 1)

If this plan seems good, I could make a PR

wannaphong commented 1 year ago

It looks good. 👍

konbraphat51 commented 1 year ago

Okey, I will handle this soon