Closed Leidanya closed 7 years ago
Hey:
Actually, there are lots of definitions for the non-general sites in my spiders - aka spider policies, which looks like:
So, here is my test case from you:
'use strict'
var $ = require('cheerio')
var read = require('read-art')
read({
uri: 'http://www.breitbart.com/big-government/2016/11/28/somali-ohio-state-student-attack/',
output: {
type: 'text'
},
selectors: {
content: {
selector: '.entry-content',
extract: (node, options) => {
node.find('script,style,#EmailOptin,#EmailOptinM,.ad').remove()
return read.Reader.extractProp($, node, options.output.type, options)
}
}
}
}, (err, art) => {
if (err) {
return console.error('[ERROR]', err)
}
console.log(art.content)
})
Ah I see now, the scorerule and such is used to identify the topCandidate, and you are suppose to use custom extract function to get manipulate child nodes within.
this has been great help, adding this to the selectors section of the README is recommended.
In addition, you can provide some more details on these polices? I understand this is private project if you can not.
Are you just passing body.reader of say 163.com.js into the options of the spider? I am trying to write a spider using read-art and any design decision and insights you have would be very helpful.
Thanks again.
url: http://www.breitbart.com/big-government/2016/11/28/somali-ohio-state-student-attack/
as you can see, the EmailOptin and EmailOptinM are in the way of content, I've tried a few ways to remove this to no luck, maybe I'm doing something wrong. I hope you can look at it with a fresh pair of eyes.
first I tried calling read.use() before read(url) with the option
this.regexps.negative(/EmailOptin|EmailOptinM/);
then I tried to use
scoreRule: function(node){ if (node.hasClass('desktop')) { return -100; } }
and finally
var selectors = { content: { // selector: '.entry-content', skipTags: '.EmailOptin' }, };
none of them seems to work for some reason, when directly requesting the URL, however use read(html) with the html from content-entity the filter/scorerule/regex works fine.
debug is giving me very little information.