Open winglight opened 8 years ago
Hi, thank you for filing the issue, I will take a look. Normally we would read the encoding from the HTTP headers, but maybe in this case it does not quite work and we can think of alternatives.
I checked the response from this url that hadn't an encoding value in the response headers so the current code can't get the correct encoding. Maybe it's an alternative way to check meta values of the response body, such as:
<meta http-equiv="Content-Type" content="text/html; charset=big5">
@winglight in this case, you can use indexOf function (and other string analysis functions) of Buffer to digest the encoding from body. Please pay particular attention that by default Node.js doesn't support too many character encodings, and big5 is not in the supporting list, so you may need to find decoder/transcoder before processing big5 encoded content given most likely your code is working with utf-8.
same problem here with a page contains charset=iso-8859-1
+1
When you don't set the encoding, the crawler will not do any encoding work for you (actually Node.js itself does not support other encoding except UTF-8/16
and ASCII
either, so it's a helpless choice). In this case, the received body can be treated as a Buffer
that contains all the raw bytes encoded in given encoding, and what you can do is to use 3rd-party decoding tools like node-iconv
or iconv-lite
to do the conversion to unicode String
that is supported by JavaScript language, after that you can process the converted string in the manner you are accustomed to.
I found wrong charset from the response content from non-utf8 web page. Here's a url for example: http://www.cartoomad.com/comic/276400012051002.html