Closed rupertqin closed 6 years ago
谢谢反馈
首先read-art
的智能抓取正文大概只能适配网络上80%的正文链接(根据我现在每日150W+的正文量抽样),剩下的就需要针对不用的域名定制一下高级规则了,例如:
在智能抓取的过程中,大概经过了上亿次的实际使用及学习,read-art
会排除一些常见的非正文类的tag
的标记属性,例如id
或者class
包含:
<div id="related_info">...
& <div id="link-report">...
),为了避免被智能正文识别为unlikely
部分,可以使用以下代码规避:
var read = require('read-art')
read.use(function () {
this.regexps.maybe(/link-report|related_info/)
})
runkit test case:
const read = require("read-art")
read.use(function () {
this.regexps.maybe(/link-report|related_info/)
})
const regRelated = /^相关(文章|链接|新闻|阅读|搜索|推荐)/
const regTextDelimiter = /[。.]/m
let trimBlank = (str) => {
return str.replace(/"/g, '"')
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/ /g, ' ')
.replace(/(\u00A0| | )+/g, ' ')
.replace(/^(\u00A0| | )+/, '')
.replace(/(\u00A0| | )+$/, '')
}
const Article = await read({
uri: 'https://www.douban.com/event/28091869/',
output: {
type: 'json',
stripSpaces: true
},
minTextLength: 0,
minParagraphs: 1,
scoreRule: (node) => {
let txt = trimBlank(node.text() || '')
if (txt && regRelated.test(txt)) {
return -100
}
let matchedPeriods = txt.match(regTextDelimiter)
if (matchedPeriods) {
return matchedPeriods.length * 5
}
return 0
}
});
Article.content.map((content) => content.type === 'text' ? content.value : '').join('\n')
同时以上配置项也是我在使用read-art
爬取正文时(智能抓取,非定制)的常规配置项
https://runkit.com/rupertqin/588244f4e95a5d001403d7f6