抓豆瓣不是很准 - Githubissues

rupertqin commented 7 years ago

https://runkit.com/rupertqin/588244f4e95a5d001403d7f6

Tjatse commented 7 years ago

谢谢反馈首先read-art 的智能抓取正文大概只能适配网络上80%的正文链接（根据我现在每日150W+的正文量抽样），剩下的就需要针对不用的域名定制一下高级规则了，例如：

在智能抓取的过程中，大概经过了上亿次的实际使用及学习，read-art会排除一些常见的非正文类的tag的标记属性，例如id或者class包含：

link 常用于链接或列表的样式或id
related 常用于相关新闻的样式或id 豆瓣events内容页的正文部分标签恰好符合上述规则（<div id="related_info">... & <div id="link-report">...），为了避免被智能正文识别为unlikely部分，可以使用以下代码规避：
```
var read = require('read-art')
read.use(function () {
this.regexps.maybe(/link-report|related_info/)
})
```

Tjatse commented 7 years ago

runkit test case:

const read = require("read-art")

read.use(function () {
  this.regexps.maybe(/link-report|related_info/)
})

const regRelated = /^相关(文章|链接|新闻|阅读|搜索|推荐)/
const regTextDelimiter = /[。.]/m
let trimBlank = (str) => {
  return str.replace(/&quot;/g, '"')
    .replace(/&amp;/g, '&')
    .replace(/&lt;/g, '<')
    .replace(/&gt;/g, '>')
    .replace(/&nbsp;/g, ' ')
    .replace(/(\u00A0| |&nbsp;)+/g, ' ')
    .replace(/^(\u00A0| |&nbsp;)+/, '')
    .replace(/(\u00A0| |&nbsp;)+$/, '')
}

const Article = await read({
  uri: 'https://www.douban.com/event/28091869/',
  output: {
    type: 'json',
    stripSpaces: true
  },
  minTextLength: 0,
  minParagraphs: 1,
  scoreRule: (node) => {
    let txt = trimBlank(node.text() || '')
    if (txt && regRelated.test(txt)) {
      return -100
    }
    let matchedPeriods = txt.match(regTextDelimiter)
    if (matchedPeriods) {
      return matchedPeriods.length * 5
    }
    return 0
  }
});

Article.content.map((content) => content.type === 'text' ? content.value : '').join('\n')

同时以上配置项也是我在使用read-art爬取正文时（智能抓取，非定制）的常规配置项

Tjatse / node-readability

抓豆瓣不是很准 #41