dnmfarrell / Perly-Bot

a RSS trawler and social media broadcaster
BSD 2-Clause "Simplified" License
15 stars 6 forks source link

Create per-source content filters #5

Closed briandfoy closed 8 years ago

briandfoy commented 9 years ago

When we fetch the main URL, know how to get it's interesting content (e.g. look in the right div) while ignoring the stuff wrapped around it. Similar sources, such as Wordpress, probably have the same id or name for that section.

dnmfarrell commented 9 years ago

Are you thinking about non-RSS sources? Because every source is RSS or atom, we usually know this. Maybe a "Source" class? We could start with Atom and RSS, and expand from there

briandfoy commented 9 years ago

No, I'm thinking about looking at URLs that the feeds point to, like a blog post. Ignore all the stuff around a blogs.perl.org post and only looking in the entry-body div to judge the content.

dnmfarrell commented 9 years ago

Ohhhh gotcha. So fetch and parse the article HTML

dnmfarrell commented 8 years ago

Ok this is implemented in master. We have a per-domain regex that extracts the text to test from the post.