Open devongovett opened 11 years ago
May we should also think about using some sort of worker queue like Kue for that.
Glad I found this project. I just started working on something similar to the feedfetcher last night. I think I'll take a stab at this one. Is there an IRC channel or anything related to this project? How many people are currently working on it?
Just read the readme.md :smile:
Oh great im one of those people who dont RTFM!!!! :) Sorry bout that
No offense ^^
I just pushed the beginnings of Feeder, a module that subscribes to a feed and emits events when posts are added or updated. Simple polling is supported for now, but pubsubhubbub and rsscloud will be supported for more realtime updates later on. This will probably be extracted into its own npm module at some point.
Here's an example of the module's API (also in src/feeder/test.js
for now):
feeder.subscribe('http://rss.badassjs.com/')
.on('meta', console.log)
.on('post', console.log)
.on('update', console.log);
The polling interval is configurable via a second parameter to feeder.subscribe
or new Feeder
if you like. feeder.subscribe
automatically starts listening for updates, right now through polling, but if the feed supports pubsubhubbub or rsscloud, we'll use that instead. post
events are emitted when a new post is added to the feed, and update
events are emitted when a post is updated.
Hopefully no one was working on this already and I just duplicated your work, sorry if so! This should be a decent starting point, so feel free to work on adding pubsubhubbub/rsscloud support or improving this.
TODO by me or someone else in regards to feeder
:
meta
event only fire when it actually changes, rather than every time the feed is reloaded.feeder.posts
also...feedparser
if needed). We need to be reliable since there are lots of not so good feeds out there to deal with.As it turns out, fetching and parsing feeds is not an easy problem. Some of the dirtier sides of the problem are documented here. If you're interested in a challenge and in helping out this project, this is the one to tackle. :)
@devongovett Yeah, fetching and parsing feeds can be a hell. Perhaps use Superfeedr's new free plan for all fetching until the rest of the app is done as the fetching isn't the core problem this project is meant to fix as far as I understand so it can therefore wait until everything else works fine? :)
@voxpelli I would be happy if someone went and did that for the time being since it would speed up testing of the other parts of the API. However, I think it's important for this to be self reliant (that's the problem we're trying to solve) so I would encourage continued work on the fetcher as well. I'm not personally going to implement the Superfeedr stuff, but if someone wants to, I'm not opposed to it as a stopgap solution.
I've decided to go ahead and use Redis and the task queue module Kue as suggested by @optikfluffel. I thought that having a module with a nice API to subscribe to a feed and emit events when posts were added or updated was a good idea, but as I thought about it more, I realized that it's not terribly efficient. We would have to have a feeder
object for each feed in memory all the time, constantly setInterval
'ing.
Using Redis as a task queue means we can distribute processing of feeds over multiple processes, which will be important for scaling later on. Nothing gets stored in memory for each feed, and we can schedule the tasks to occur at any time we'd like. The code I've pushed so far isn't done, it's just a preview of the general idea I think I'd like to pursue. I've left feeder
up in the repo for now, but I think dropping that in favor of the task queue approach is a better idea.
Let me know what you think!
The feed fetcher should run as a separate process in the background and refresh each feed every given interval. We should also support pubsubhubbub for push updates, but that can come later. When updates are found, they should be stored in MongoDB.
Some interesting modules to look at:
Note that we need to be able to support a wide variety of feed types, so the more robust the modules we use, the better.
TODO
Ideally, the feed fetcher/parser itself would be a reusable module that allows subscription to a feed, and then provides events when new items are found. Then we'd have an application specific fetcher that would use that module to receive and store posts in the database. The API (i.e. not the fetcher) will also need to access the module that loads actual feeds in order to add them when someone subscribes to a new feed that we're not already crawling, so separating those things out is important.