Closed briandfoy closed 8 years ago
Are you thinking about non-RSS sources? Because every source is RSS or atom, we usually know this. Maybe a "Source" class? We could start with Atom and RSS, and expand from there
No, I'm thinking about looking at URLs that the feeds point to, like a blog post. Ignore all the stuff around a blogs.perl.org post and only looking in the entry-body div to judge the content.
Ohhhh gotcha. So fetch and parse the article HTML
Ok this is implemented in master. We have a per-domain regex that extracts the text to test from the post.
When we fetch the main URL, know how to get it's interesting content (e.g. look in the right div) while ignoring the stuff wrapped around it. Similar sources, such as Wordpress, probably have the same id or name for that section.