amoilanen / js-crawler

Web crawler for Node.JS
MIT License
253 stars 55 forks source link

How to assign encoding of response content? #26

Open winglight opened 8 years ago

winglight commented 8 years ago

I found wrong charset from the response content from non-utf8 web page. Here's a url for example: http://www.cartoomad.com/comic/276400012051002.html

amoilanen commented 8 years ago

Hi, thank you for filing the issue, I will take a look. Normally we would read the encoding from the HTTP headers, but maybe in this case it does not quite work and we can think of alternatives.

winglight commented 8 years ago

I checked the response from this url that hadn't an encoding value in the response headers so the current code can't get the correct encoding. Maybe it's an alternative way to check meta values of the response body, such as: <meta http-equiv="Content-Type" content="text/html; charset=big5">

tibetty commented 8 years ago

@winglight in this case, you can use indexOf function (and other string analysis functions) of Buffer to digest the encoding from body. Please pay particular attention that by default Node.js doesn't support too many character encodings, and big5 is not in the supporting list, so you may need to find decoder/transcoder before processing big5 encoded content given most likely your code is working with utf-8.

ngouy commented 7 years ago

same problem here with a page contains charset=iso-8859-1

aidik commented 5 years ago

+1

tibetty commented 5 years ago

When you don't set the encoding, the crawler will not do any encoding work for you (actually Node.js itself does not support other encoding except UTF-8/16 and ASCII either, so it's a helpless choice). In this case, the received body can be treated as a Buffer that contains all the raw bytes encoded in given encoding, and what you can do is to use 3rd-party decoding tools like node-iconv or iconv-lite to do the conversion to unicode String that is supported by JavaScript language, after that you can process the converted string in the manner you are accustomed to.