Open quantoid opened 6 years ago
Hi for this issue and also for real world language which can often be cramped up with numerous punctuation marks, I tried various tokenizers and was satisfied with the way nltk's TweetTokenizer works. I implemented it as follows:
from nltk.tokenize import TweetTokenizer, sent_tokenize
tokenizer_words = TweetTokenizer()
def _generate_phrases(self, sentences):
phrase_list = set()
for sentence in sentences:
word_list = [word.lower() for word in tokenizer_words.tokenize(sentence)]
phrase_list.update(self._get_phrase_list_from_words(word_list))
return phrase_list
Not only does this chalk out www.google.com as is, it also conserves important marks such as #hashtag, @person, etc.
@nsehwan: I am open to any extension to the package as long as the following are met:
Even though it meets the (1) requirement I think we should first formulate your simple solution to a generic one so that it can be used by everyone before implementing it.
Thanks @csurfer for the information, working on your suggestions
Sorry for my evanesce !!! After trying various tokenizers, I thought it better to build a sanitizer/tokenizer based on your suggestions. And really it was actually better that way, i.e. more general.
get_sanitized_word_list is basically a function which takes as input individual sentences, segregated by sent_tokenize and returns list of words similar to what wordpunct_tokenize(sentence) was returning previously but sanitized better.
`def get_sanitized_word_list(data): result = [] word = ''
for char in data:
if char not in string.whitespace:
if char not in string.ascii_letters + "'.~`^:<>/-_%&@*#$123456789": #List of whatever could be within or at start/end of words
if word:
result.append(word)
result.append(char)
word = ''
else:
word = ''.join([word,char])
else:
if word:
result.append(word)
word = ''
if word != '':
result.append(word)
word=''
return result`
It works on most general cases that I tried so far. And yes better than TweetTokenizer as well. Please let me know what you think about this.
If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".