commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
123 stars 24 forks source link

Clean titles and descriptions #8

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

Some titles contain characters like 🔥, which we probably do not want in the search results.

Is there a simple way (or existing Python module?) to clean all those characters without messing with international characters?

Sentimentron commented 8 years ago

U+1F525 belongs the Miscellaneous Symbols and Characters Unicode block and can be filtered using Python's unicodedata module, the category in question is 'Cn'.

>>> print u'\U0001F42D'
🐭
>>> [unicodedata.category(c) for c in u'\U0001F42D']
['Cn']
sylvinus commented 8 years ago

@Sentimentron that looks like the right solution!

Should we whitelist of blacklist classes? https://en.wikipedia.org/wiki/Unicode_character_property

Should be straightforward to implement now!

Sentimentron commented 8 years ago

Example search demonstrating the behaviour: https://uidemo.commonsearch.org/?g=en&q=emoji

snapshot1

Sentimentron commented 8 years ago

Also interesting discovery: cchardet is not able to correctly determine the encoding of these symbols, even though it can detect UTF-8. Thus, if the page is decoded by chardet, it's unlikely to be able to strip these symbols.

>>> import cchardet
>>> import urllib2
>>> ta_dic = urllib2.urlopen("http://www.tamildict.com/english.php").read()
>>> cchardet.detect(ta_dic)
{'confidence': 0.9900000095367432, 'encoding': u'UTF-8'}
>>> ta_em = u"😋  Super Emoji-Land.com"
>>> ta_em = ta_em.encode('utf8')
>>> cchardet.detect(ta_em)
{'confidence': 0.8154354095458984, 'encoding': u'ISO-8859-9'}
>>> print ta_em.decode('ISO-8859-9')
ğ  Super Emoji-Land.com
>>> ta_em = urllib2.urlopen("http://unicode.org/emoji/charts/full-emoji-list.html").read()
>>> cchardet.detect(ta_em)
{'confidence': 0.4998016357421875, 'encoding': u'WINDOWS-1252'}

The last example is particularly damning, since it's page that consists of basically nothing except emoji and their UTF-8 encodings.

sylvinus commented 8 years ago

Good find! I'm not sure how we could fix this. maybe there are not enough emoji in the dataset cchardet was trained on?

sylvinus commented 8 years ago

@Sentimentron looking back at the patch, I think we should also remove emojis in descriptions, don't you think?

Sentimentron commented 8 years ago

So I did some searching: Google does strip emoji's from descriptions, but Bing doesn't. I think Bing's results for "pile of poop emoji" are actually more descriptive.

sylvinus commented 8 years ago

Interesting! My instinct would be to remove them, but maybe we can reconsider later when the results will have evolved a bit!