Closed entertainyou closed 8 years ago
这个地方少确实有个bug,就是没有执行自定的scoreRule,已修复,谢谢。 权值的生成是完依赖于标准文档流,当dom树结构为:
div#0
div#1
div#1.1
div#1.1.1
p#1.2.1
p#1.2.2
p#1.2.3
p#1.2.4
div#2
div#2.1
p#2.1.1
p#2.1.2
这种非正常结构时div#1.1.1
的权值应该是最高的,父节点权值依次衰减为子节点的1/2,这样是为了保证抓取正文的准确性,不然每次都会抓取到body
, article
这种顶级节点,会导致大量冗余内容出现。
在实际的爬虫中我会定义一些特殊的scoreRule来单独处理,例如:
readability-rules/sohu.com.js
:
module.exports = function(node){
if (node.attr('itemprop') == 'articleBody') {
return 100;
}
return 0;
};
spider.js
:
var urijs = require('urijs');
var rules = require('./readability-rules');
// ...
var uri = urijs(data.url);
read(data.url, {
scroleRule: rules[uri.domain()]
}, function(){});
即:
read('http://news.sohu.com/20151228/n432833902.shtml', {
timeout : 15000,
minTextLength: 0,
scoreRule: function(node){
if (node.attr('itemprop') == 'articleBody') {
return 100;
}
return 0;
}, function(err, art, options, resp){
if (err) {
console.log('[ERROR]', err.message);
return;
}
if (!art) {
console.log('[WARNING] article not exist');
return;
}
console.log('[INFO]', 'title:', art.title);
console.log('[INFO]', 'content:', art.content);
});
这样会提高你需要的这个父节点的权值,增加正文的精确性。
Thanks for the quick reply.
Add score rule will work, but will need to update when sites html structure changes.
Do you think it's reasonable to add a option to tweak the 1/2 value?
在实际的爬虫中我会定义一些特殊的scoreRule来单独处理
It's the spider in public domain?
BTW, this is a nice project.
That's a marvelous idea, and I'll add an option such as damping
in next release:
var dampedScore = score / (isFinite(options.damping) ? options.damping : 2);
Our spiders serve the enterprises only, but there is a search engine for user experience.
Appreciate for your advice!!!
Looking forward the new release, :)
Fixed in v0.4.3-rc1!
Some part are missing(like "12月23日,在塞尔维亚诺维萨德,塞尔维亚总理武契奇在匈塞铁路塞尔维亚段启动仪式上致辞。"), the issue seems is related to score distributing.
https://github.com/Tjatse/node-readability/blob/master/lib/reader.js#L539
Assume the dom is like the following:
The second div may have score bigger than the root div.
Why the score is halved when assign to grant parent(can we use some higher rate)?