DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

configure text element #28

Open edsu opened 7 years ago

edsu commented 7 years ago

It might be useful to be configure a feed with a CSS selector to specify what element to extract text from with readability. For example the Washington Post currently use

<article itemprop="articleBody">...</article>

To enclose the text of the article using https://schema.org/NewsArticle microdata. Perhaps the config could look like:

- name: Washington Post - Politics
  url: http://feeds.washingtonpost.com/rss/politics
  css_selector: article[itemprop="articleBody"]
  twitter:
   access_token: foo
   access_token_secret: bar

I guess the downside to this is that sites change, so unless you are watching it you may not notice when their markup changes, and your diffengine instance would quietly stop working.