devongovett / reader

An API Compatible Replacement for Google Reader
355 stars 26 forks source link

Feed fetching #4

Open devongovett opened 11 years ago

devongovett commented 11 years ago

The feed fetcher should run as a separate process in the background and refresh each feed every given interval. We should also support pubsubhubbub for push updates, but that can come later. When updates are found, they should be stored in MongoDB.

Some interesting modules to look at:

  1. https://github.com/danmactough/node-feedparser
  2. https://github.com/fent/node-feedsub
  3. https://github.com/andris9/pubsubhubbub
  4. https://github.com/technoweenie/nubnub
  5. https://github.com/superfeedr/superfeedr-node

Note that we need to be able to support a wide variety of feed types, so the more robust the modules we use, the better.

TODO

Ideally, the feed fetcher/parser itself would be a reusable module that allows subscription to a feed, and then provides events when new items are found. Then we'd have an application specific fetcher that would use that module to receive and store posts in the database. The API (i.e. not the fetcher) will also need to access the module that loads actual feeds in order to add them when someone subscribes to a new feed that we're not already crawling, so separating those things out is important.

optikfluffel commented 11 years ago

May we should also think about using some sort of worker queue like Kue for that.

treasonx commented 11 years ago

Glad I found this project. I just started working on something similar to the feedfetcher last night. I think I'll take a stab at this one. Is there an IRC channel or anything related to this project? How many people are currently working on it?

optikfluffel commented 11 years ago

Just read the readme.md :smile:

treasonx commented 11 years ago

Oh great im one of those people who dont RTFM!!!! :) Sorry bout that

optikfluffel commented 11 years ago

No offense ^^

devongovett commented 11 years ago

I just pushed the beginnings of Feeder, a module that subscribes to a feed and emits events when posts are added or updated. Simple polling is supported for now, but pubsubhubbub and rsscloud will be supported for more realtime updates later on. This will probably be extracted into its own npm module at some point.

Here's an example of the module's API (also in src/feeder/test.js for now):

feeder.subscribe('http://rss.badassjs.com/')
    .on('meta', console.log)
    .on('post', console.log)
    .on('update', console.log);

The polling interval is configurable via a second parameter to feeder.subscribe or new Feeder if you like. feeder.subscribe automatically starts listening for updates, right now through polling, but if the feed supports pubsubhubbub or rsscloud, we'll use that instead. post events are emitted when a new post is added to the feed, and update events are emitted when a post is updated.

Hopefully no one was working on this already and I just duplicated your work, sorry if so! This should be a decent starting point, so feel free to work on adding pubsubhubbub/rsscloud support or improving this.

devongovett commented 11 years ago

TODO by me or someone else in regards to feeder:

devongovett commented 11 years ago

As it turns out, fetching and parsing feeds is not an easy problem. Some of the dirtier sides of the problem are documented here. If you're interested in a challenge and in helping out this project, this is the one to tackle. :)

voxpelli commented 11 years ago

@devongovett Yeah, fetching and parsing feeds can be a hell. Perhaps use Superfeedr's new free plan for all fetching until the rest of the app is done as the fetching isn't the core problem this project is meant to fix as far as I understand so it can therefore wait until everything else works fine? :)

devongovett commented 11 years ago

@voxpelli I would be happy if someone went and did that for the time being since it would speed up testing of the other parts of the API. However, I think it's important for this to be self reliant (that's the problem we're trying to solve) so I would encourage continued work on the fetcher as well. I'm not personally going to implement the Superfeedr stuff, but if someone wants to, I'm not opposed to it as a stopgap solution.

devongovett commented 11 years ago

I've decided to go ahead and use Redis and the task queue module Kue as suggested by @optikfluffel. I thought that having a module with a nice API to subscribe to a feed and emit events when posts were added or updated was a good idea, but as I thought about it more, I realized that it's not terribly efficient. We would have to have a feeder object for each feed in memory all the time, constantly setInterval'ing.

Using Redis as a task queue means we can distribute processing of feeds over multiple processes, which will be important for scaling later on. Nothing gets stored in memory for each feed, and we can schedule the tasks to occur at any time we'd like. The code I've pushed so far isn't done, it's just a preview of the general idea I think I'd like to pursue. I've left feeder up in the repo for now, but I think dropping that in favor of the task queue approach is a better idea.

Let me know what you think!