commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Tokenizer improvements #40

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

Our current tokenizer is... rather simple :)

Let's discuss what would be reasonable, short-term improvements as well as some mid-term ideas?

We should take into account the way documents are indexed in elasticsearch (currently a big list of words) and the tokenization we could do on search queries (currently none).

Sentimentron commented 8 years ago

One thing that's occured to me is that Python 2's re module isn't fully Unicode aware. Choosing some examples from Wikipedia's page on this:

>>> _RE_WHITESPACE.split(u']\u2029[')
[u']\u2029[']

Whereas in Python 3's interpreter:

>>> _RE_WHITESPACE.split(u']\u2029[')
[']', '[']

Back in Python 2, the split method actually works better:

>>> u']\u2029['.split()
[u']', u'[']
sylvinus commented 8 years ago

Right. One more reason not to use simple regexes for this :)

Sentimentron commented 8 years ago

I've just become aware of the NLTK's nltk.tokenize.casual module, which might be appropriate for this job.