regex ignored? - Githubissues

Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.

341 stars 36 forks source link

regex ignored? #40

Closed Leidanya closed 7 years ago

Leidanya commented 7 years ago

url: http://www.breitbart.com/big-government/2016/11/28/somali-ohio-state-student-attack/

as you can see, the EmailOptin and EmailOptinM are in the way of content, I've tried a few ways to remove this to no luck, maybe I'm doing something wrong. I hope you can look at it with a fresh pair of eyes.

first I tried calling read.use() before read(url) with the option this.regexps.negative(/EmailOptin|EmailOptinM/);

then I tried to use scoreRule: function(node){ if (node.hasClass('desktop')) { return -100; } }

and finally var selectors = { content: { // selector: '.entry-content', skipTags: '.EmailOptin' }, };

none of them seems to work for some reason, when directly requesting the URL, however use read(html) with the html from content-entity the filter/scorerule/regex works fine.

debug is giving me very little information.

Tjatse commented 7 years ago

Hey: Actually, there are lots of definitions for the non-general sites in my spiders - aka spider policies, which looks like:

So, here is my test case from you:

'use strict'

var $ = require('cheerio')
var read = require('read-art')

read({
  uri: 'http://www.breitbart.com/big-government/2016/11/28/somali-ohio-state-student-attack/',
  output: {
    type: 'text'
  },
  selectors: {
    content: {
      selector: '.entry-content',
      extract: (node, options) => {
        node.find('script,style,#EmailOptin,#EmailOptinM,.ad').remove()
        return read.Reader.extractProp($, node, options.output.type, options)
      }
    }
  }
}, (err, art) => {
  if (err) {
    return console.error('[ERROR]', err)
  }
  console.log(art.content)
})

Leidanya commented 7 years ago

Ah I see now, the scorerule and such is used to identify the topCandidate, and you are suppose to use custom extract function to get manipulate child nodes within.

this has been great help, adding this to the selectors section of the README is recommended.

In addition, you can provide some more details on these polices? I understand this is private project if you can not.

Are you just passing body.reader of say 163.com.js into the options of the spider? I am trying to write a spider using read-art and any design decision and insights you have would be very helpful.

Thanks again.