Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Detect and use charset if not set #4

Closed midudev closed 8 years ago

midudev commented 8 years ago

Right now, utf8 is by default the charset but it will give problems if the website is using a different charset. For example this URL: http://www.elimparcial.com/EdicionEnlinea/Notas/Sonora/22092015/1010394-Firma-CPA-convenio-con-Cofemer.html

It would be great if the script detects the charset before parsing it as, you can see, now it's returning wrong encoding to this kind of URLs.

Also, and I suppose that related, the option for charset is unused in the code right now.

Tjatse commented 8 years ago

The charset works for me, and req-fast detects the charset automatically. I'll test your URL tomorrow and figure it out.

midudev commented 8 years ago

Okey, as I can see:

So, the problem, in fact, is that the charset is working like a charm but the website has the wrong charset in the header. :) Problem solved! Thanks!

Tjatse commented 8 years ago

Yes, you gotcha :)