andrew-thox / pb-journalist

Responsible for scraping sites
Eclipse Public License 1.0
0 stars 0 forks source link

Record acquisition method #12

Open andrew-thox opened 8 years ago

andrew-thox commented 8 years ago

The Independent have a lot of RSS-feeds and these are pretty dynamic. Pretty much stick /rss on the end of anything and you get an rss feed. This is pretty positive but also complicates things. We don't know how they curate feeds such as /news/rss which may or may not contain be a superset of /news/uk/rss, /news/world/rss and so on and so forth.

Any analytics we do to record the frequency of articles would also be useful to work out these rss feeds, but this will only be useful if we know which feed the articles come from. The more granular rss feeds could also help categorise articles.

NOTE: Some of these RSS feeds are also entirely useless. If you put /rss on the end of a contributor url you get any number of articles but not a complete list. They do not seem to be influenced by date of publication.

andrew-thox commented 8 years ago

Acquisition method should also be summarized in influx.