Feed entry expiration should be configurable, and publicised

evdb commented 10 years ago

I believe that entries in feeds expire after 60 days. It would be great if this timescale could be configured in the settings - I'd want it much longer. It could also be more visibly displayed on the site (I only found reference to it in other sections).

benubois commented 10 years ago

Hi @evdb,

Happy to discuss these limits with you.

The 60 day limit is how long unread items are stored for. After 60 days they are automatically marked as read. There is also a limit of 500 entries per feed.

The reasons for these limits are performance and cost.

There are two separate scaling issues that both these limits are meant to address.

60 day unread entry limit: Unread entries are stored as individual records in the database. This number gets big fast. The math is something like (number of customers x number of entries per customer). There are about 100,000,000 unread entry records in the database. These records are small so they don't take up a significant amount of hard drive space, but they need to be accessed very quickly since every entry lookup also needs to include the read/unread status. To make this as fast as possible there are a number of indexes on unread entries. To stay performant, indexes should fit into RAM. The indexes on unread entries alone add up to about 37GB. In Feedbin's lifetime 1,090,032,892 unread entry records have been created, so about 10 times what is stored right now. If there was no limit here that would require ~~3.7TB~~ 370GB of RAM to keep the indexes in memory. Unread entries do get deleted as people mark items as read, but the vast majority of items stay unread.
500 entry per feed limit: This limit as all about reducing hard drive space. There are about 40,000,000 entries stored in the database which takes up about 220GB of hard drive space. So far 285,540,600 entries have been created so about 7x the current data set or 1.5TB of hard drive space. In the current setup there's 1.6 TB of available HD space.

The solution to these scaling issues would be to buy more hardware, but then we run into the second factor: cost.

The primary database server costs 1,808 USD/month. To keep adding this type of machine is prohibitively expensive. Feedbin's only source of revenue is paying customers and right now $1,808 would be a significant expense.

Maybe you could explain a bit more about your use case. What issues are you running into with the current limits?

evdb commented 10 years ago

Interesting constraints. I'll give this more thought, and look through your existing schema and see if I can come up with an approach that lets me have more unread and you have lower DB bills :)

evdb commented 10 years ago

I tend to subscribe to many feeds, and then read the entire feed in oldest first order. However there are many feeds that I've not got round to reading / skimming for more than 60 days. The behaviour that prompted this issue was that I'd look at feedbin.me in the morning and see something that looked interesting in the 'Unread' list (which for me is sorted oldest first) and then when I returned in the afternoon it was gone.

The way I read feeds is definitely oldest entries first, and only rarely do I skip an entry (I either read it, or skip it, either way it should be marked as read). The only time I'll have an unread entry with read neighbors is if I've decided to leave it as unread to prompt me to do something with it later (I read both using Reeder on my phone and using a browser on the feedbin.me website).

For my use case a data model like this might be more efficient than your current one - at least in the amount of data stored, if not computation needed to work with (presented as a JSON document, but easily represented using tables and rows):

{
  "feeds": {
    "foo.blog.com/feed.rss": {
      "all_read_before": "2013-04-17 12:34:56",
      "all_unread_after": "2013-04-20 12:34:56",
      "unread_entries": [ "foo12", "foo13", "foo23" ],
    },
    ...
  }
}

The timestamps for the all_read_before and all_unread_after, and the psuedo-IDs for unread_entries would need changing to something more compatible with your current data model.

Here is a pretty picture (each purple horizontal line is a feed, being read in a particulary way):

2014_01_22_10_50_40

The advantages are that:

there is a point before which everything is considered read.
there is a point after which everything is considered unread, including new entries as they are added to the feed (no need to create new unread rows for all subscribers when new entries scraped).
extra storage is only needed for unread items that appear between the all_read_before and all_unread_after markers.
calculating unread counts is done by counting number of entries in feed after all_unread_after and adding the number of entries in unread_entries. If value used in all_unread_after is snapped to the timestamp of an entry then this becomes cacheable up until the point a new entry is added to the feed (in which case cache could be trivially updated, or just re-calculated).
when someone marks a feed as read it is simple to update (set all_read_before and all_unread_after to the latest entry timestamp, clear read_entries).

Disadvantages are that:

finding oldest unread items is no longer a simple select on unread items, but would need to be something more clever that looked at entries in feeds in all_read_before order and gathered up entries as needed to populate list.
calculating total unread count for all feeds is not just a select count(*) query on one table, it would be at best the sum of a series of cache hits plus a count(*) operation, at worst two queries per feed plus the read count.
marking an entry as read may not be a simple deletion of a row, but would involve calculating if the two markers should change too.
entries that appear out of date order might confuse things somewhat (although possibly not, have not fully thought this bit through). May not be an issue if the timestamps you use are when you first see the entry, rather than when the entry was claimed to be published.

But perhaps the biggest win is that this model would completely separate the feed entries from the subscriber data. As there is now no need the update the unread data when new entries are scraped.

As such the two datastores could be separated. Possibly the subscriber data could even be archived to a file on disk and deleted from database after some period of inactivity. It could then be loaded back into the database when the user returns from the site. I know that I usually read in small chunks a couple of times a day, so I could certainly be ditched from the database for large periods of time. A document database like MongoDB would be well suited to this (one document per user) and has querying methods that would be suitable.

Sorry to be so long winded, and to go off-topic so much for the title of this ticket. I've also probably been very simplistic and there are lots of edge cases that I've not considered. It is an interesting problem though :)

evdb commented 10 years ago

Unread entries do get deleted as people mark items as read, but the vast majority of items stay unread.

Hmm, if this is the case perhaps the unread_entries should be changed to read_entries for items between the markers. The way of calculating the unread count would need to change a little, but would use the same number of queries, I believe.

benubois commented 10 years ago

Hi @evdb,

Thanks for taking the time to flesh out this idea, it definitely looks interesting.

Before going much further with considering a new model I was wondering how many unread items you typically have?

evdb commented 10 years ago

I've currently got 1050 unread, but it has been as high as 1500. I know that I should cull my reading list :)

thenerdlawyer commented 9 years ago

Not sure if this is exactly the same request (I think I'm asking for the converse), but I would love to have Feedbin automagically mark as read items in certain feeds that have been sitting longer than, e.g., 2 or 3 days. I have several feeds that are voluminous but the info in them quickly becomes stale. Culling out the older posts automatically would speed things up for me considerably.

coffeemug commented 4 years ago

I would love to have Feedbin automagically mark as read items in certain feeds that have been sitting longer than, e.g., 2 or 3 days. I have several feeds that are voluminous but the info in them quickly becomes stale

This is a few years old, so wanted to give it a bump. This would be super-valuable.

feedbin / support

Feed entry expiration should be configurable, and publicised #419