I suggest to add a dictionary-based consonant-remover method.
As like เริศศศศศศศศศศศศศศ -> เริศ
Context
I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current pythainlp.utils.normalize() removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis.
Plus the implementation was a little long, so I wanted this method in pythainlp library
Possible implementation
My implementation was like below.
#>>against เริศศศศศศศศศศศศศศ
if (len(sentence) > 2) and pythainlp.util.isthaichar(sentence[-1]) and (sentence[-1] == sentence[-2]):
# The last of the sentence has duplication (duplication typically at the last)
dup = sentence[-1]
#find the words in the dictionary that has duplication at the last
#required here because dictio dynamically added
repeaters = []
for word in dictio:
if (len(word) > 2) and (word[-1] == dup) and (word[-2] == dup):
all_same = True
for cnt_1 in range(len(word)):
if word[cnt_1] != dup:
all_same = False
break
if not all_same:
repeaters.append(word)
#check if there is matching with repeaters
sentence_head = sentence
while(sentence_head[-1] == dup):
if (len(sentence_head) == 1):
break
sentence_head = sentence_head[:-1]
found = False
for repeater in repeaters:
rep_head = repeater
repetition = 0
while(rep_head[-1] == dup):
rep_head = rep_head[:-1]
repetition += 1
if sentence_head[-len(rep_head):] == rep_head:
found = True
break
if found:
sentences[cnt] = sentence_head + (dup * repetition)
else:
sentences[cnt] = sentence_head + (dup * 1)
Detailed description
I suggest to add a dictionary-based consonant-remover method. As like เริศศศศศศศศศศศศศศ -> เริศ
Context
I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current
pythainlp.utils.normalize()
removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis. Plus the implementation was a little long, so I wanted this method in pythainlp libraryPossible implementation
My implementation was like below.
If this plan seems good, I could make a PR