commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
123 stars 24 forks source link

Still some encoding issues #9

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

There seem to be a few cases left where we get � characters in search results:

Check to see if they can be fixed, and prefer ignoring the error completely than showing any �.

glebourgeois commented 8 years ago

I did not look yet into these two examples specifically, but considering this :

if self.doc.source_headers.get("content-type"):
            header_encoding = get_encoding_from_content_type(self.doc.source_headers["content-type"])
            if header_encoding:
                return header_encoding

I can assure you that you cannot trust the encoding specified in headers ; they are quite often badly declared.

sylvinus commented 8 years ago

Indeed!

Do you have another solution in mind? Would be great to avoid doing chardet on all documents just to fix the <1% cases where they are badly declared :(

glebourgeois commented 8 years ago

Nope, you just cannot trust your input data, that's the beauty of the web ;) And as you know, garbage in, garbage out.

By the way, on the "true web", id most of the websites hand-written by noob dev, you'll have far more than 1% errors ; they will have copy/pasted an example with utf-8 declaration from the web, while writing html code in windows notepad with windows-1252 encoding. And as the browsers are systematically detecting the encodings before displaying a page, the will never know their mistakes.... (thanks for the legendary browsers tolerance)