bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
MIT License
6.7k stars 877 forks source link

Queue HTML code string directly with non utf-8 charset attribute #378

Closed tautcony closed 10 months ago

tautcony commented 3 years ago

While using a HTML string directly and a non utf-8 encoding has been specified in the HTML string(as the origin html file is using that encoding).

Then self._parseCharset(response); will return that non utf-8 encoding and try to re-decode the string, thus causing string garbled. So it requried passing incomingEncoding or encoding to the option to make it works correctly or passing a non-decoded Buffer directly to the html property(if this is mostly used for tests, passing Buffer should be a rare case).

Would it be more appropriate to skip the re-encoding operation while passing string to the html property?

POC:

const Crawler = require('crawler');
const Iconv = require('iconv-lite');

const html = `<html>
<head>
    <title>アイウエオ・かきくけこ</title>
    <meta http-equiv="Content-Type" content="text/html; charset=EUC-JP">
</head>
<body></body>
</html>`;

const htmlBuffer = Iconv.encode(html, 'EUC-JP');

const c = new Crawler({
    callback: (error, res, done) => {
        console.log(res.$('title').text());
        done();
    }
});

// output: △Θ��KMOQS
c.queue({ html });
// output: アイウエオ・かきくけこ
c.queue({ html, incomingEncoding: 'utf-8' });
// output: アイウエオ・かきくけこ
c.queue({ html, encoding: null });
// output: アイウエオ・かきくけこ
c.queue({ html: htmlBuffer });

output:

Iconv-lite warning: decode()-ing strings is deprecated. Refer to https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding
△Θ��KMOQS
アイウエオ・かきくけこ
アイウエオ・かきくけこ
アイウエオ・かきくけこ
mike442144 commented 10 months ago

Passing string when queuing is a rare case, so I would not spend more efforts.