Consuming paged feeds - Githubissues

RFC5005 is a standard that was published in 2007 for "Feed Paging and Archiving" for Atom and RSS. I'd like Aperture to support consuming feeds using at least section 2, and preferably also section 4, of this standard.

Section 2, "Complete Feeds", should be easy, I hope. If a feed contains an empty <fh:complete/> tag, then Aperture should discard any saved entry whose GUID is no longer present in the feed document. In other words, the entries that are currently in the document should be treated as the complete set of entries for this feed. In this mode, if an entry had been present in the past but is missing now, that's because it was deleted upstream, not because it scrolled out of the most recent 10 entries.

It might be nice if Microsub would also signal to clients that this feed is complete, allowing them to offer different UI if they want to. I'm not sure they strictly need it though. Doing this purely in the Microsub server should go a long way.

Support for section 4, "Archived Feeds", would also be great. In that mode, a feed should not have the <fh:complete/> tag, but instead should have a <link rel="prev-archive"> tag, where the href attribute points to another feed document. If you successfully walk the prev-archive links all the way back until there aren't any more, then you should have a complete feed, as in the first case.

For efficiency, however, you're permitted to treat the archived feed documents as if they have far-future Expires headers. If the publisher needs to change any of the archived feed documents, it needs to generate a new URL for them, so the next time you fetch the main feed you'll see that its prev-archive link has changed. This means you need to keep enough information to notice that some archived feed documents you've fetched before aren't part of the history any more, and also to discard any entries that were in those feed documents but haven't been copied into the new feed documents.

In addition, the spec says that an entry with the same GUID may appear in multiple feed documents, and if so, you should only use the version from the most recent feed document. This is a trade-off the publisher can choose to make: if an old entry is changed, it can avoid forcing existing clients to redownload the old archives, at the cost of making new clients download a larger total number of entries.

I hope all of that is easier to do than it is to explain. :sweat_smile: I think the simplest thing is to recompute the entire set of entries whenever the feed changes. But since the common case is that only recent entries get updated, and since a large feed may have a lot of archived pages, it'd be nice to avoid most of the recomputation when possible.

It also might be nice to load archived pages lazily, in response to Microsub clients using the paged API. That strikes me as more complicated state to keep in the server but it's probably worth doing eventually.

Adding support for section 4 means there are some new error modes that might need to be reported to Microsub clients. In particular, fetching one of the archive pages might fail even though newer ones succeeded, and I'm not sure how you'd inform the user of that failure mode.

In short: I think supporting RFC5005 section 2 is easy and I'd love to see that done first; then section 4 is more useful but opens up some additional questions that may need more thought.

aaronpk / Aperture

Consuming paged feeds #45