Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

The result is not perfect on http://news.sohu.com/20151228/n432833902.shtml #12

Closed entertainyou closed 8 years ago

entertainyou commented 8 years ago

Some part are missing(like "12月23日,在塞尔维亚诺维萨德,塞尔维亚总理武契奇在匈塞铁路塞尔维亚段启动仪式上致辞。"), the issue seems is related to score distributing.

https://github.com/Tjatse/node-readability/blob/master/lib/reader.js#L539

Assume the dom is like the following:

div
  div
    p 
    p
  div
    p

The second div may have score bigger than the root div.

Why the score is halved when assign to grant parent(can we use some higher rate)?

Tjatse commented 8 years ago

这个地方少确实有个bug,就是没有执行自定的scoreRule,已修复,谢谢。 权值的生成是完依赖于标准文档流,当dom树结构为:

div#0
  div#1
    div#1.1
      div#1.1.1
        p#1.2.1
        p#1.2.2
        p#1.2.3
        p#1.2.4
  div#2
    div#2.1
      p#2.1.1
      p#2.1.2

这种非正常结构时div#1.1.1的权值应该是最高的,父节点权值依次衰减为子节点的1/2,这样是为了保证抓取正文的准确性,不然每次都会抓取到body, article这种顶级节点,会导致大量冗余内容出现。

在实际的爬虫中我会定义一些特殊的scoreRule来单独处理,例如:

readability-rules/sohu.com.js:

module.exports = function(node){
    if (node.attr('itemprop') == 'articleBody') {
      return 100;
    }
    return 0;
};

spider.js:

var urijs = require('urijs');
var rules = require('./readability-rules');
// ...
var uri = urijs(data.url);
read(data.url, {
  scroleRule: rules[uri.domain()]
}, function(){});

即:

read('http://news.sohu.com/20151228/n432833902.shtml', {
  timeout  : 15000,
  minTextLength: 0,
  scoreRule: function(node){
    if (node.attr('itemprop') == 'articleBody') {
      return 100;
    }
    return 0;
}, function(err, art, options, resp){
  if (err) {
    console.log('[ERROR]', err.message);
    return;
  }
  if (!art) {
    console.log('[WARNING] article not exist');
    return;
  }

  console.log('[INFO]', 'title:', art.title);
  console.log('[INFO]', 'content:', art.content);
});

这样会提高你需要的这个父节点的权值,增加正文的精确性。

entertainyou commented 8 years ago

Thanks for the quick reply.

Add score rule will work, but will need to update when sites html structure changes.

Do you think it's reasonable to add a option to tweak the 1/2 value?

在实际的爬虫中我会定义一些特殊的scoreRule来单独处理

It's the spider in public domain?

BTW, this is a nice project.

Tjatse commented 8 years ago

That's a marvelous idea, and I'll add an option such as damping in next release:

var dampedScore = score / (isFinite(options.damping) ? options.damping : 2); 

Our spiders serve the enterprises only, but there is a search engine for user experience.

Appreciate for your advice!!!

entertainyou commented 8 years ago

Looking forward the new release, :)

Tjatse commented 8 years ago

Fixed in v0.4.3-rc1!