CrawlScript / WebCollector

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
https://github.com/CrawlScript/WebCollector
GNU General Public License v3.0
3.07k stars 1.45k forks source link

继承BreadthCrawler,获取网页中文部分输出乱码 #108

Open linye271709915 opened 5 years ago

linye271709915 commented 5 years ago

visit里面 String name = page.select("h1").text(); String content = page.select("h2").html();

System.out.println("名称"+ name); System.out.println("内容"+ content);

打印台结果 名称姝h���瑁��ㄨ���ㄧО����瑁��虹�-DXDK110 内容姝h���瑁��ㄨ���ㄧО����瑁��虹�-DXDK110浜у��绠�浠

hujunxianligong commented 5 years ago

可以通过page.charset("utf-8")方法,设置对应的网页编码后,再进行上述操作。

xiejx618 commented 5 years ago

@hujunxianligong cn.edu.hfut.dmic.webcollector.util.CharsetDetector#guessEncoding可不可以改改,当猜测为gb2312时,直接修改为GB18030。 GB18030兼容GBK和GB2312,比如这个页面 http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/44/4419.html 它的页面明明是gb2312,但cn.edu.hfut.dmic.webcollector.model.Page#html()就是乱码。使用浏览器也没乱码。但用page.charset("GB18030")也没乱码,但不想每个页面都设一下。