Closed joe1chen closed 12 years ago
Thanks Joe,
The only thing i'm concerned about is say you have a few thousand followers, it will return way too many activity objects which will take up a lot of memory. There has to be a better way to insert activity, maybe bypassing mongoid all together and just use the driver to iterate over the receiver list? That would also improve insert speeds dramatically as it won't need to construct all those objects through mongoid.
Thanks for contributing to this project, make sure you add your name to the contributors list before I pull in any changes.
Christos Pappas
On Tuesday, 12 July 2011 at 3:32 AM, joe1chen wrote:
Hi Christos, I think the changes for sharding support are ready now. The API is almost entirely backwards compatible with the original version. The major differences are: 1) The Activity.publish function returns an array of activities, rather than a single activity. 2) The options now support a :receiver argument in addition to the :receivers argument. The :receiver argument was added to support passing in just a single receiver (useful for publishing to self). 3) There is no longer an easy way to re-publish, so I've commented out those test methods (see activity_spec.rb).
All of the specs should be passing now with the exception of the republish test I mentioned above.
Joe
Reply to this email directly or view it on GitHub: https://github.com/christospappas/streama/pull/3
Hi Christos, Good point regarding the memory usage. The easiest way to address that issue for now is to not return the array of activities from Activity.publish. I made that change and updated the specs to pass. I also added my name to the credits section of the README (thanks for including me). I agree that we can probably drop into the driver to iterate through the receiver list. I'll take a quick look to see if there's an easy way to do this.
I think we will need to drop to the driver to do these inserts. At the moment every time you create an activity it will load instances of the actor, object and target to cache. We should really only be doing this once per publish.
I'll try make some time next week to work through this.
Looks like you can batch insert by calling Activity.collection.insert(json)
where json is the json for multiple documents.
I did a couple of benchmarks using batch insert method. Here's are the tests I added to activity_spec.rb:
describe 'batch insertion performance test' do
max_batch_size = 500
num_followers = 50000
it "no batch insert" do
(1..num_followers).each do |n|
activity = Activity.new({:verb => :enquiry, :receiver => user, :actor => user, :object => enquiry, :target => listing})
end
end
it "batch insert" do
batch = []
(1..num_followers).each do |n|
activity = {}
activity["actor"] = {}
activity["actor"]["type"] = user.class.to_s
activity["actor"]["id"] = user._id
activity["actor"]["full_name"] = user.full_name
activity["object"] = {}
activity["object"]["type"] = enquiry.class.to_s
activity["object"]["id"] = enquiry._id
activity["object"]["comment"] = enquiry.comment
activity["target"] = {}
activity["target"]["type"] = listing.class.to_s
activity["target"]["id"] = listing._id
activity["target"]["title"] = listing.title
activity["receiver"] = {}
activity["receiver"]["id"] = receiver._id
activity["receiver"]["type"] = receiver.class.to_s
batch << activity
if batch.size % max_batch_size == 0
Activity.collection.insert(batch)
batch = []
end
end
Activity.collection.insert(batch)
end
end
And the results:
$ rspec spec --profile
..................................
Top 10 slowest examples:
Activity batch insertion performance test no batch insert
11.69 seconds ./spec/lib/activity_spec.rb:176
Activity batch insertion performance test batch insert
3.71 seconds ./spec/lib/activity_spec.rb:183
So a big improvement with the batch insert. I also played around with the batch size. Seems like 500 items was the sweet spot. When I set it to 50000 items, the insertion time jumped to about 7 seconds. When I decreased it to 200 items, the time also increased to around 5.6 seconds.
I haven't looked into the DSL code closely enough to figure out how to generate the JSON (including the cached fields) based on the activity's verb but once we figure that out then I think we're set.
I'm also assuming that the Ruby MongoDB driver's batch insert will properly work in a sharded environment. It looks like it should but I haven't tested it yet.
Hi Chris,
The batch insert code has been checked in. When you get a chance, take a look at batch_insert
in the activity.rb file. Just as an experiment, I also added the faststep gem (https://github.com/joshuaclayton/faststep), which is a C-extension on top of the C Mongo driver, and did the batch insert using faststep.
Top 10 slowest examples:
Activity batch insertion performance test no batch insert
15.42 seconds ./spec/lib/activity_spec.rb:237
Activity batch insertion performance test batch insert
6.35 seconds ./spec/lib/activity_spec.rb:244
Activity batch insertion performance test batch insert with faststep
5.19 seconds ./spec/lib/activity_spec.rb:282
So now inserting 50000 items, the batch insert is about 5 seconds with faststep, 6 seconds with the Mongo Ruby driver, and ~15 seconds using Mongoid (without batch).
Faststep looks promising but is very new, so I will probably stay away from it for now.
Nice work! will take a look soon...
Any chance of these changes being merged?
@neiltron @joe1chen Commenting is essential for many activity feeds. It seems like this generates one activity object per user. How would you support commenting on activity feed items with this schema change? In the current streama implementation it's trivial to just add an embedded collection for comments and re-publish. [Edited to include @joe1chen]
Not sure which commit you're referring to. The only one I actually authored was adding the actor_activity_stream method. Maybe you meant to mention @joe1chen?
@digitalplaywright As with any change to denormalize data, it becomes more challenging to update data because data has been duplicated around in the system.
It is definitely not as trivial as adding an embedded collection for comments, although that can be the start:
class Activity
include Streama::Activity
embeds_many :comments
activity :some_action do
actor :user, :cache => [:username, :avatar]
object :some_object, :cache => []
receiver :user, :cache => []
end
end
class Comment
include Mongoid::Document
include Mongoid::Timestamps
field :name
field :body
embedded_in :activity
end
Now when a user adds a comment to an activity, only that one activity for that user has been updated. We'd have to add a mechanism for updating all of the other activities for the followers. There are 2 parts to this:
1) We should have a unique identifier of some kind that allows us to select all the activities that need updating. I'm thinking that we could use a timestamp for this. So looking at the code right now, to support the mechanism above, we can change the :created_at timestamp to be the same when an activity is published (currently it is different for each user's activity).
2) We would add a method that selects all the activities based on this timestamp, and then performs the update. In the case of an embedded comments object, we would need to push the comment into each activity.
A rough example of how this would look if using the Ruby driver might look like:
Activity.collection.update({ "created_at" => timestamp }, {"$push"=>{"comments"=>{ "name" : "Joe", "body" : "This is my comment."}}, { :multi => true })
or
Activity.collection.update({ "created_at" => timestamp }, {"$push"=>new_comment.to_json}, { :multi => true })
I haven't tested the code above yet, but hopefully you get the idea.
In the meantime, I will make the timestamp change so that the above can be supported.
:clap:
Hi Joe/Neil,
So i've been looking through these changes and i'm not totally sure I want to pull them in yet, here are my reasons:
I agree that sharding is a good idea/necessary for large scale apps, however for most apps the existing structure works well...
I don't really know how others are using this gem so it would be good to get some more feedback from others before moving in this direction as it changes things a lot.
Hi Christos, I think the changes for sharding support are ready now. The API is almost entirely backwards compatible with the original version. The major differences are: 1) The Activity.publish function returns an array of activities, rather than a single activity. 2) The options now support a :receiver argument in addition to the :receivers argument. The :receiver argument was added to support passing in just a single receiver (useful for publishing to self). 3) There is no longer an easy way to re-publish, so I've commented out those test methods (see activity_spec.rb).
All of the specs should be passing now with the exception of the republish test I mentioned above.
Joe