继承BreadthCrawler，获取网页中文部分输出乱码

CrawlScript / WebCollector

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

https://github.com/CrawlScript/WebCollector

GNU General Public License v3.0

3.07k stars 1.45k forks source link

继承BreadthCrawler，获取网页中文部分输出乱码 #108

Open linye271709915 opened 5 years ago

linye271709915 commented 5 years ago

visit里面 String name = page.select("h1").text(); String content = page.select("h2").html();

System.out.println("名称"+ name); System.out.println("内容"+ content);

打印台结果名称姝ｈ��瑁��ㄨ��ㄧО��瑁��虹�-DXDK110 内容姝ｈ��瑁��ㄨ��ㄧО��瑁��虹�-DXDK110浜у��绠�浠

hujunxianligong commented 5 years ago

可以通过page.charset("utf-8")方法，设置对应的网页编码后，再进行上述操作。

xiejx618 commented 5 years ago

@hujunxianligong cn.edu.hfut.dmic.webcollector.util.CharsetDetector#guessEncoding可不可以改改，当猜测为gb2312时，直接修改为GB18030。 GB18030兼容GBK和GB2312，比如这个页面 http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/44/4419.html 它的页面明明是gb2312,但cn.edu.hfut.dmic.webcollector.model.Page#html()就是乱码。使用浏览器也没乱码。但用page.charset("GB18030")也没乱码，但不想每个页面都设一下。