Domain names treated as sentences

quantoid commented 6 years ago

If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".

ghost commented 6 years ago

Hi for this issue and also for real world language which can often be cramped up with numerous punctuation marks, I tried various tokenizers and was satisfied with the way nltk's TweetTokenizer works. I implemented it as follows:

from nltk.tokenize import TweetTokenizer, sent_tokenize tokenizer_words = TweetTokenizer() def _generate_phrases(self, sentences): phrase_list = set() for sentence in sentences: word_list = [word.lower() for word in tokenizer_words.tokenize(sentence)] phrase_list.update(self._get_phrase_list_from_words(word_list)) return phrase_list Not only does this chalk out www.google.com as is, it also conserves important marks such as #hashtag, @person, etc.

csurfer commented 6 years ago

@nsehwan: I am open to any extension to the package as long as the following are met:

It is a problem for the vast majority.
The solution to the problem can be made generic enough.

Even though it meets the (1) requirement I think we should first formulate your simple solution to a generic one so that it can be used by everyone before implementing it.

ghost commented 6 years ago

Thanks @csurfer for the information, working on your suggestions

ghost commented 6 years ago

Sorry for my evanesce !!! After trying various tokenizers, I thought it better to build a sanitizer/tokenizer based on your suggestions. And really it was actually better that way, i.e. more general.

get_sanitized_word_list is basically a function which takes as input individual sentences, segregated by sent_tokenize and returns list of words similar to what wordpunct_tokenize(sentence) was returning previously but sanitized better.

`def get_sanitized_word_list(data): result = [] word = ''

for char in data:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'.~`^:<>/-_%&@*#$123456789": #List of whatever could be within or at start/end of words
            if word:
                result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
if word != '':
    result.append(word)
    word=''
return result`

It works on most general cases that I tried so far. And yes better than TweetTokenizer as well. Please let me know what you think about this.

csurfer / rake-nltk

Domain names treated as sentences #24