Rearchitect "read/unread" status in the activity streams

kytrinyx commented 8 years ago

We have two activity streams, a track activity stream, and a team activity stream.

The track stream is what everyone gets access to once they have submitted a solution to an exercise in that track. It can filter by "mine" (everything the user has submitted in that track) or "everything" or by problem, so that you can look at all the "gigasecond" submissions, or whatever.

The team activity stream is something that all team members have access to. It can filter on language, language/problem, user, or user/language.

It's important to show whether or not something in that stream has new activity since you last looked at it.

The way that we are currently doing that is to have a "views" table which compares the "last viewed at" with a "last updated at" timestamp on the solution.

                                       Table "public.views"
     Column     |            Type             |                     Modifiers                      
----------------+-----------------------------+----------------------------------------------------
 id             | integer                     | not null default nextval('views_id_seq'::regclass)
 user_id        | integer                     | not null
 exercise_id    | integer                     | not null
 last_viewed_at | timestamp without time zone | not null
 created_at     | timestamp without time zone | not null
 updated_at     | timestamp without time zone | not null
Indexes:
    "views_pkey" PRIMARY KEY, btree (id)
    "index_views_on_user_id_and_exercise_id" UNIQUE, btree (user_id, exercise_id)

This table is growing exponentially, and we're going to hit the limit of how many rows we're allowed to have on Heroku. I'm currently paying $9/month, which gives us 10MM rows altogether. The views table is over 6MM rows at the moment.

Any ideas about how to tackle this in a more reasonable way?

trueheart78 commented 8 years ago

Is data in the table pruned at any point? If not, would it make any sense to do so?

drteeth commented 8 years ago

That was my thought also. Make the view states only valid for a certain window (2 years?).

Otherwise, does anyone know if the row count limit could be sidestepped by using a pg array or json type?

bernardoamc commented 8 years ago

A temporary "solution" would be removing entries from this table based on the last time the user logged in, it is not unreasonable to erase data from inactive users. We could even archive this information and restore it if the user is back using something like a background job. Unfortunately this solution doesn't solve things in the long run.

Another option would be remodeling using jsonb, so maybe the users table (or something similar, don't know the current modeling) could contain a column called solved_problems with type jsonb. This column would contain something like {problem_id: { last_viewed: 'xyz'}, problem_id: { last_viewed: 'xyz'}, ...}. This would be slow since we would need check every user to see if he has solved the problem, so we could also have a users column in the problems table that is of type array. With this we could drop the views table, the problem is that this is a major change.

We could also use a mix of both strategies, we can put a flag in the users table to see if the user is active or inactive and just update the jsonb of the active users. When an inactive user gets back we can generate his json and flag him as active.

kytrinyx commented 8 years ago

I think we could prune some old data, but... there's not really any old data in there yet... the oldest record is from March 2015 (a little over a year ago).

We never have to check viewed status for more than one user at a time, so that's something. I am worried about how big such a json blob would be. For example, my current number of viewed exercises is 35067, and it's only going to get bigger.

trueheart78 commented 8 years ago

That makes sense, @kytrinyx. It's good to know what some of the limitations may be. Is your number of viewed exercises the highest, or does someone else have more than 35k?

kytrinyx commented 8 years ago

I would guess that someone has a higher number, let me check.

kytrinyx commented 8 years ago

The person with the highest number of views per today has 102042 of them. So definitely higher.

trueheart78 commented 8 years ago

Thanks for checking. The fact that one user can generate 100k records just by perusing the site definitely seems like a data bloat issue. I will throw this to a DBA tomorrow and see what they say.

kytrinyx commented 8 years ago

Yeah, this is a dire, dire data bloat issue. I feel like we need to attack it from a different angle, but I'm not seeing anything obvious just yet.

cforain commented 8 years ago

@kytrinyx correct me if I'm wrong, because I don't know exactly your modelling, but doesn't this user have over 100k rows on the relationship table as well? And if so, is this table properly (double) indexed? Let's remember that indexes also take space. Also, to my understanding, this notification could be asynchronous, right? If it can, the jsonb parsing could be a background job, so it being slower wouldn't be such an issue.

I think that the better way to see if the jsonb solution is feasible, would be to simulate a json with that size (100k entries) on the database and compare it with the space that the 100k rows (if that's the case) take, as well as a benchmark on the database generated querying vs parsing the json. Isn't it worth at least testing? And sorry if I'm talking nonsense, I am not familiarized with the project nor have I checked the source code yet. Regards.

cforain commented 8 years ago

Also, I have worked with zips compression and decompression on a rails apps, and pure text files (such as json) tend to become very small, like 90% compression rate. Obviously it would be a trade off between disk space vs processing power used to compress/decompress files on the fly, but in my case, it was worth it.

I worked with the zip files on the file system so I could compress/decompress them on a ramdisk (to smooth the I/O blocking), but it seems that there is already an postgres extension to compress the jsonb. I think that it's worth to check it out:

https://www.citusdata.com/blog/14-marco/156-compressing-jsonb-using-cstore-fdw

bernardoamc commented 8 years ago

As far as I can see this is a "nice to have" kind of feature, but we can work without it. The exercises are already sorted by last_activity_at, so we can see the latest submissions first. We could keep this stream activity without the viewed part and add an option to track a specific type of problem (something like the watch option on github, but similar to what we have now, without emails). Or add this viewed part on the client-side using localStorage? I know that it will not be as good as the current feature, but it is a possibility.

We also could find the top reviewers of each track and ask if removing this feature would compromise their workflow and if they have suggestions, since they will be the most impacted.

trueheart78 commented 8 years ago

@cforain The limitation, right now, is not disk space, but actual number of rows in the database. The Heroku plan that we're currently utilizing has a 10MM row limit, and bumping up to the next plan is 4.5x more expensive ($9 per month to $50 per month).

cjestel commented 8 years ago

A co-worker asked me if I'd mind taking a look at this from a DBA perspective. I'm curious to know more about a person who has 102042 views. Based on the description of what a view is, does that sound reasonable for a person to achieve? In short, I'm curious if you might be seeing crawlers or things of that nature that are logging in and hitting everything they can. You might consider setting your robots.txt up if you haven't already.

To the root of your question itself though, you might look into table partitioning. I haven't used it in postgres myself but have in other RDBMS. Partitioning normally has some tradeoffs around fk enforcement (if you are wanting to fk to a partitioned table), and may have a few other gotchas. You can read more about it here. Basically it will allow you to create a master table, and then a bunch of smaller tables and associate those smaller tables to the master table. You insert into the master table and based on rules you setup it will insert/update/delete to the correct child table. Normally this type of approach is used in analytic realms with hundreds of millions of rows. The other benefit you can get from partitioning is that if you were to implement a retention policy and your tables are structured using that retention policy (for instance created_at by month), you could drop a child table and not see the same performance hits that you would see if you were to delete a lot of records and then have vacuum run.

Lastly as a side note based on your description of how you use this table, it sounds like you would benefit from additional indexes on the following fields: user_id, exercise_id, last_viewed_at. Since you can filter on user_id by itself or exercise_id by itself those should definitely help with regular query performance. If you also are using the last_viewed_at in those same queries, you may see even better performance from composite keys of user_id, last_viewed_at and exercise_id, last_viewed_at.

bernardoamc commented 8 years ago

@cjestel Unfortunately the table partitioning solution doesn't seems to work since the constraint is in the total number of rows of the database instead of the table (correct me if I'm wrong). If the constraint is in the table only postgres deal really well with partitioning. :)

kytrinyx commented 8 years ago

I don't know exactly your modelling, but doesn't this user have over 100k rows on the relationship table as well?

Sorry -- that was unclear. This person has looked at 100k solutions, they have only submitted about 200 themselves.

To have a row in the "views" table means that you have looked at a given solution at some point.

kytrinyx commented 8 years ago

this notification could be asynchronous, right?

This is not currently implemented as a notification, it's a list of solutions in a given category (by a person, in a language, or a specific exercise in that language). It's realtime in the sense that when you go to the activity stream, then go look at a solution, then go back to the activity stream, the solution you just looked at will no longer be displayed as "unread"

go-leap-read-unread

As far as I can see this is a "nice to have" kind of feature, but we can work without it.

No, I can't work without this. When I actively review exercises, I look at these lists every day, and review anything from 20 to 50 exercises. I know when to stop when I get to the ones where I've previously viewed them.

I think that the better way to see if the jsonb solution is feasible, would be to simulate a json with that size (100k entries) on the database and compare it with the space that the 100k rows (if that's the case) take, as well as a benchmark on the database generated querying vs parsing the json. Isn't it worth at least testing?

Yes, absolutely. I will be testing that, but it seemed relevant to mention that people would be likely to have data that grows at an astonishing rate.

kytrinyx commented 8 years ago

pure text files (such as json) tend to become very small, like 90% compression rate

Yeah, I love working with pure text files. In this case we're running on heroku, so I would have to look into whether or not there would be a way of doing that, if we wanted to experiment with something in that direction (text files, when the person logs in, suck that into redis, use that while they're on the site, and then have a background job to dump redis data back to text, for example).

The jsonb (with the compression) would be simpler though.

kytrinyx commented 8 years ago

Based on the description of what a view is, does that sound reasonable for a person to achieve? In short, I'm curious if you might be seeing crawlers or things of that nature that are logging in and hitting everything they can.

It's unlikely that it's crawlers--to browse the solutions freely you need to be logged in. 100k is absolutely doable. I wouldn't be surprised if I had 100k views in reality, because we only started tracking this a few months ago. The older views are ones that I migrated based on whether or not someone had commented on a solution.

@cjestel I will look at the queries I'm running and add indices--thanks for the tip!

kytrinyx commented 8 years ago

Thanks to everyone who has weighed in so far! This is really helpful.

I also want to apologize for sounding dismissive--I forget to say "yes, definitely want to look into that, and here's the thing that I worry about" instead of "here's what worries me".

bernardoamc commented 8 years ago

No, I can't work without this. When I actively review exercises, I look at these lists every day, and review anything from 20 to 50 exercises. I know when to stop when I get to the ones where I've previously viewed them.

Good to know, thanks! 👍

So, for now the only idea is to use jsonb, the possible downside being the size of the field. Maybe we can split this field by track? So we find unread problems by checking the user_id or user_id and track_id if we are in a specific track. So, in the views table a member with 3 tracks would have just three rows. Don't know if this would easy the problem or if the person with 100k views has just one or two tracks.

drteeth commented 8 years ago

Could the "mark all read" feature be leveraged somehow? The idea being keeping deltas from the last time a track is fully read?

Alice marks python read as of time A
We write alice's last track viewed time (in a separate table... (user_id, track_id, last_viewed_at))
Alice's python Views are deleted
Bob and Charlie both submits new python solutions at time B
We know Alice hasn't read these as they occurred after time A
Alice marks Bob's most recent as read
We add a View for Alice on Bob's solution.

Creating the stream would involve combining the two tables.

This wouldn't address the issue completely, just sort of curb the growth in some cases. For example, I have 6300 unread elixir solutions in my stream, if I mark them all read, that's 6300 new rows if I'm reading correctly. Using deltas for tracks would make that 1 row instead.

It doesn't address the case where people have a more swiss-cheese usage pattern, marking many solutions viewed but never the whole track and it also introduces a lot of complexity.

I'm shooting form the hip a bit here, would something like this even work?

cjestel commented 8 years ago

Ah, sorry, I missed the all together piece of the 10m rows. It sounds like the rows have real value on their own. From a database perspective, I'd probably want to keep it the way it is but obviously that doesn't solve your issue.

I've never worked with heroku before and don't know what their limitations are. To buy yourself some more space without the 4.5x jump in cost, could you maybe make another app that has it's own database and look into offloading the views to it's own app/database with lookup calls through an api? This would definitely break any fk dependencies from the table, but you could then put in app logic to handle sharding based on something like user_id range and slowly stand up more app/database servers as you grow until you get to the point where it may make sense from a financial/package point to bring the data back together (based on how their pricing tiers work). If you go this route, I'd build it out to allow you to split on the user_id range from the start so when you go to add a third app server or fourth, that you don't have to change the app again outside of a simple block of code that points it to the correct api service.

cjestel commented 8 years ago

I'd be hesitant to use array or jsonb types to try and reduce row counts as you currently search on these fields. You'll see serious load spikes on certain types of filters if you can't access the data directly.

cforain commented 8 years ago

@kytrinyx

his is currently implemented as a notification, it's a list of solutions in a given category (by a person, in a language, or a specific exercise in that language). It's realtime in the sense that when you go to the activity stream, then go look at a solution, then go back to the activity stream, the solution you just looked at will no longer be displayed as "unread"

Yes, but from what understood, it doesn't have to be "right away", right? For instance, if it took 30 seconds to parse de json and generate the notifications, would it be a problem?

@cforain The limitation, right now, is not disk space, but actual number of rows in the database. The Heroku plan that we're currently utilizing has a 10MM row limit, and bumping up to the next plan is 4.5x more expensive ($9 per month to $50 per month).

@trueheart78 I understand. I'm not too used to have this kind of "artificial" limits, so I thought that disk space was relevant. But if that's the case, I think it even more likely that the jsonb approach would be a good fit in your case, given it has feasible performance.

@kytrinyx I don't know much about heroku constraints, but even if they don't support the jsonb and the compression extension, you could do it manually by storing a field with the file location in a varchar (or similar), store the json in a structured way on your filesystem and using this field to uncompress, open, parse and recompress the json on demand, using tmp files.

@bernardoamc and I have done this in the past and it became a pretty good solution in our case. You just have to have extra care with the security you process and deliver the files, as well as take extra care with your backups, and exclude these files from your git.

cforain commented 8 years ago

I'd be hesitant to use array or jsonb types to try and reduce row counts as you currently search on these fields. You'll see serious load spikes on certain types of filters if you can't access the data directly.

@cjestel while I do partially agree with you, I don't know if in their case the load spike is a problem (if the performance is good), since the only bottleneck right now is the number of rows. Also, from what I understand, you would search only from the current user, so maybe it isn't such a problem after all.

cjestel commented 8 years ago

@cforain

The track stream is what everyone gets access to once they have submitted a solution to an exercise in that track. It can filter by "mine" (everything the user has submitted in that track) or "everything" or by problem, so that you can look at all the "gigasecond" submissions, or whatever.

It looks like they have the ability to filter by problem (I assume this is related to exercise_id) which is where I could see the performance penalty. Sharding definitely has it's own set of issues and requires a decent amount of application coding, but gives a scaleable approach to this problem that also has a slightly more linear growth in expense.

Ultimately business decisions such as cost do drive things. An array or jsonb type could help with the current issue but I don't think it would scale well. Sometimes sub-optimal decisions are necessary to grow something to the point where it can pay for itself. With the amount of data that has come in over a years time I'd recommend in investing in more horsepower for the current server (which has already been stated as undesirable) or sharding to achieve a better growth direction without compromising performance.

cforain commented 8 years ago

It looks like they have the ability to filter by problem (I assume this is related to exercise_id) which is where I could see the performance penalty. Sharding definitely has it's own set of issues and requires a decent amount of application coding, but gives a scaleable approach to this problem that also has a slightly more linear growth in expense. Ultimately business decisions such as cost do drive things. An array or jsonb type could help with the current issue but I don't think it would scale well. Sometimes sub-optimal decisions are necessary to grow something to the point where it can pay for itself. With the amount of data that has come in over a years time I'd recommend in investing in more horsepower for the current server (which has already been stated as undesirable) or sharding to achieve a better growth direction without compromising performance.

@cjestel I couldn't agree more.

@kytrinyx As for the sharding approach, is it possible to deploy some kind of master-slave replication ,such as Slony, on heroku ? Maybe you could make a partial one-direction replication, and prune old data from the master, keeping it on slave (s), and make reads not found on master on slaves .

kytrinyx commented 8 years ago

Could the "mark all read" feature be leveraged somehow?

Yeah, that's an excellent question. I think we could have a separate table where we could keep a record of watermarks--everything older than timestamp X has been seen. That way we could take a little bit of load off of the views table. It wouldn't solve the problem entirely, but it might buy us some time. I'll do some digging to see how many rows it would save us.

could you maybe make another app that has it's own database and look into offloading the views to it's own app/database with lookup calls through an api?

Offloading into a separate app is definitely an option, especially with some sharding by user_id. The foreign key thing doesn't have to be a big problem--I could ask for the view status of the actual exercise_ids that the person is looking at (these are paginated 50 at a time).

Yes, but from what understood, it doesn't have to be "right away", right? For instance, if it took 30 seconds to parse de json and generate the notifications, would it be a problem?

My impression is that it would make the site feel kind of broken if you get back to the main stream page and it shows "unread" for the thing that you were just looking at. The views are for the current user, not the person who wrote the solution (and might not be logged in at the time). On exercises that I've reviewed a lot it sometimes only takes me five or six seconds to see a typical code smell and copy/paste the right comment that I've prewritten.

That said, if it wasn't 30 seconds but a bit less, it could work. I'll think about how we could do it.

but even if [heroku] don't support the jsonb and the compression extension, you could do it manually by storing a field with the file location in a varchar (or similar)

They do support jsonb. I need to look into the compression extension, but my guess is that they would support it.

It looks like they have the ability to filter by problem (I assume this is related to exercise_id) which is where I could see the performance penalty.

Right. The data goes like this:

track has an id (e.g. ruby or haskell)
each problem has a slug (e.g. leap or hamming)
various tracks have implementations of a given problem (hamming in Ruby, hamming in Haskell)
each person can submit solutions to any problem in any track (exercise_id, legacy name, solution_id would be better).
a solution has a track_id (e.g. ruby) and a slug (e.g. hamming) and a last_activity_at

Ultimately business decisions such as cost do drive things.

Right, especially since I'm basically running this project out of my own pocket, and until I get some form of sponsorship (or a business model, but that's a different story) I have personal constraints.

As for the sharding approach, is it possible to deploy some kind of master-slave replication ,such as Slony, on heroku ? Maybe you could make a partial one-direction replication, and prune old data from the master, keeping it on slave (s), and make reads not found on master on slaves .

Yeah, I think Slony is available--I'd have to look at the cost structure for that, but this is definitely an option.

scottynomad commented 8 years ago

@kytrinyx If splitting this data out of postgres into another data source is an option then redis could be well suited to the task. I think it would look like an ordered set of viewed solution ids (ordered most recent first), for each tuple of (user, exercise).

Showing the unread count becomes a matter of subtracting the cardinality of the set from from the total solutions. And rendering the list of solutions you just pull the first bunch of items off of the set in redis and mark those solutions as read.

Would be fun to try loading the data into redis to see how big it ends up being. Would be glad to help.

kytrinyx commented 8 years ago

I've upgraded to a larger database which doesn't have the 10MM row limit.

kytrinyx commented 8 years ago

Thanks so much, everyone, for helping think this through! :sparkles:

kotp commented 8 years ago

@kytrinyx Still out of pocket? Or did we find a sponsor?

kytrinyx commented 8 years ago

Heroku is giving us platform credits! (so: found a sponsor)

exercism / discussions

Rearchitect "read/unread" status in the activity streams #15