GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 15 forks source link

Adding Optional Text Preprocessing Steps #267

Open ertugrul-dmr opened 3 years ago

ertugrul-dmr commented 3 years ago

Adding extra text pre-processing options might come in handy for different use cases and might improve our model performances on some datasets. These options can be implemented:

Since these steps applied on raw strings before tokenization comes with another question, where shall we implement them? bblock.util.py maybe?

husnusensoy commented 3 years ago

Can you send the snippet used for each. I believe they can be added as tokenizer exception rules

ertugrul-dmr commented 3 years ago

Emoji Removal:

import re
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

Text: remove_emoji("bugün forvetler 🔥🔥") Output: bugün forvetler

Text: remove_emoji("Komedi😂") Output: Komedi

URL Removal:

import re
def remove_urls(string):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', string)

Text: remove_urls('Şu adresten bulabilirsin: https://www.imdb.com/title/tt0050083/') Output: Şu adresten bulabilirsin:

HTML Tags Removal:

import re

def remove_html(string):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', string)

text = """                </span>
            </div>
        </div>                                   
    </div>
      </div>              
  <script>
    if ('csm' in window) {
      csm.measure('csm_TitleReviewsAndPopularityWidget_finished');
    }
  </script>
    </div>"""

remove_html(text) Output:


    if ('csm' in window) {
      csm.measure('csm_TitleReviewsAndPopularityWidget_finished');
    }

Digit Removal:

import re

def remove_digit(string):    
    return re.sub(r'\d+', '', string)

Text: remove_digit('20 kişi saat 12 gibi geldi') Output: kişi saat gibi geldi

dafajon commented 2 years ago

@ertugrul-dmr Is there any work or PR on this? If not, let's prioritize this for the release after the next one.

askarbozcan commented 2 years ago

Note to self: Should be done in a general way, allowing users to add their own custom preprocessing steps if necessary.