Open ertugrul-dmr opened 3 years ago
Can you send the snippet used for each. I believe they can be added as tokenizer exception rules
Emoji Removal:
import re
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Text:
remove_emoji("bugün forvetler 🔥🔥")
Output:
bugün forvetler
Text:
remove_emoji("Komedi😂")
Output:
Komedi
URL Removal:
import re
def remove_urls(string):
url_pattern = re.compile(r'https?://\S+|www\.\S+')
return url_pattern.sub(r'', string)
Text:
remove_urls('Şu adresten bulabilirsin: https://www.imdb.com/title/tt0050083/')
Output:
Şu adresten bulabilirsin:
HTML Tags Removal:
import re
def remove_html(string):
html_pattern = re.compile('<.*?>')
return html_pattern.sub(r'', string)
text = """ </span>
</div>
</div>
</div>
</div>
<script>
if ('csm' in window) {
csm.measure('csm_TitleReviewsAndPopularityWidget_finished');
}
</script>
</div>"""
remove_html(text)
Output:
if ('csm' in window) {
csm.measure('csm_TitleReviewsAndPopularityWidget_finished');
}
Digit Removal:
import re
def remove_digit(string):
return re.sub(r'\d+', '', string)
Text:
remove_digit('20 kişi saat 12 gibi geldi')
Output:
kişi saat gibi geldi
@ertugrul-dmr Is there any work or PR on this? If not, let's prioritize this for the release after the next one.
Note to self: Should be done in a general way, allowing users to add their own custom preprocessing steps if necessary.
Adding extra text pre-processing options might come in handy for different use cases and might improve our model performances on some datasets. These options can be implemented:
Since these steps applied on raw strings before tokenization comes with another question, where shall we implement them?
bblock.util.py
maybe?