Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

抓豆瓣不是很准 #41

Closed rupertqin closed 6 years ago

rupertqin commented 7 years ago

https://runkit.com/rupertqin/588244f4e95a5d001403d7f6

Tjatse commented 7 years ago

谢谢反馈 首先read-art 的智能抓取正文大概只能适配网络上80%的正文链接(根据我现在每日150W+的正文量抽样),剩下的就需要针对不用的域名定制一下高级规则了,例如:

在智能抓取的过程中,大概经过了上亿次的实际使用及学习,read-art会排除一些常见的非正文类的tag的标记属性,例如id或者class包含:

Tjatse commented 7 years ago

runkit test case:

const read = require("read-art")

read.use(function () {
  this.regexps.maybe(/link-report|related_info/)
})

const regRelated = /^相关(文章|链接|新闻|阅读|搜索|推荐)/
const regTextDelimiter = /[。.]/m
let trimBlank = (str) => {
  return str.replace(/"/g, '"')
    .replace(/&/g, '&')
    .replace(/&lt;/g, '<')
    .replace(/&gt;/g, '>')
    .replace(/&nbsp;/g, ' ')
    .replace(/(\u00A0| |&nbsp;)+/g, ' ')
    .replace(/^(\u00A0| |&nbsp;)+/, '')
    .replace(/(\u00A0| |&nbsp;)+$/, '')
}

const Article = await read({
  uri: 'https://www.douban.com/event/28091869/',
  output: {
    type: 'json',
    stripSpaces: true
  },
  minTextLength: 0,
  minParagraphs: 1,
  scoreRule: (node) => {
    let txt = trimBlank(node.text() || '')
    if (txt && regRelated.test(txt)) {
      return -100
    }
    let matchedPeriods = txt.match(regTextDelimiter)
    if (matchedPeriods) {
      return matchedPeriods.length * 5
    }
    return 0
  }
});

Article.content.map((content) => content.type === 'text' ? content.value : '').join('\n')

同时以上配置项也是我在使用read-art爬取正文时(智能抓取,非定制)的常规配置项