Bigram & character-level tokenization

rasbt commented 8 years ago

Hi, Andy, I am a bit confused about the Bigram & character-level tokenization in sklearn (e.g., as shown in Nb 03.4 ... say we have the following text tokenized as follows:

>> X
['Some say the world will end in fire,', 'Some say in ice.']

>> char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
>> char_vectorizer.fit(X)

>> print(char_vectorizer.get_feature_names())
[' e', ' f', ' i', ' s', ' t', ' w', 'ay', 'ce', 'd ', 'e ', 'e,', 'e.', 'en', 'fi', 'he', 'ic', 'il', 'in', 'ir', 'l ', 'ld', 'll', 'me', 'n ', 'nd', 'om', 'or', 're', 'rl', 'sa', 'so', 'th', 'wi', 'wo', 'y ']

Why would we end up with these single characters when we set ngram_range=(2, 2); I thought we'd only get these for e.g., ngram_range=(1, x)?

amueller commented 8 years ago

there is a space around the single characters ;). You can use char_wb to respect the word boundaries.

rasbt commented 8 years ago

haha, thanks, I think I need glasses ;).

amueller / scipy-2016-sklearn

Bigram & character-level tokenization #8