commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Strip Unicode Emoji characters from page titles #42

Closed Sentimentron closed 8 years ago

Sentimentron commented 8 years ago

Performance seems mostly unaffected. Also contains additional logic to check that that the underlying document is encoded correctly if it's Unicode. Fixes #8.

sylvinus commented 8 years ago

Cool, thanks a lot!

The patch to formatting.py looks great, though it seems to break pylint

I'd like to avoid the need for the patch on htmlencoding.py, and try to keep current behaviour, which is that HTMLDocument only accept utf-8 strings (because that's what gumbo-parser accepts). I think you would just need to encode it properly in your test instead, right?

thanks!

Sentimentron commented 8 years ago

I'm sure that won't be a problem.

On 20 March 2016 at 21:50, Sylvain Zimmer notifications@github.com wrote:

Cool, thanks a lot!

The patch to formatting.py looks great, though it seems to break pylint https://travis-ci.org/commonsearch/cosr-back/builds/117307266#L430

I'd like to avoid the need for the patch on htmlencoding.py, and try to keep current behaviour, which is that HTMLDocument only accept utf-8 strings (because that's what gumbo-parser accepts). I think you would just need to encode it properly in your test instead, right?

thanks!

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/commonsearch/cosr-back/pull/42#issuecomment-199037602

sylvinus commented 8 years ago

Thanks a lot!!! Can't wait to refresh the index with all these fixes ;)