christospappas / streama

Streama is a simple activity stream gem for use with the Mongoid ODM framework.
MIT License
132 stars 30 forks source link

Schema changes to support MongoDB sharding by recipient #3

Closed joe1chen closed 12 years ago

joe1chen commented 13 years ago

Hi Christos, I think the changes for sharding support are ready now. The API is almost entirely backwards compatible with the original version. The major differences are: 1) The Activity.publish function returns an array of activities, rather than a single activity. 2) The options now support a :receiver argument in addition to the :receivers argument. The :receiver argument was added to support passing in just a single receiver (useful for publishing to self). 3) There is no longer an easy way to re-publish, so I've commented out those test methods (see activity_spec.rb).

All of the specs should be passing now with the exception of the republish test I mentioned above.

Joe

christospappas commented 13 years ago

Thanks Joe,

The only thing i'm concerned about is say you have a few thousand followers, it will return way too many activity objects which will take up a lot of memory. There has to be a better way to insert activity, maybe bypassing mongoid all together and just use the driver to iterate over the receiver list? That would also improve insert speeds dramatically as it won't need to construct all those objects through mongoid.

Thanks for contributing to this project, make sure you add your name to the contributors list before I pull in any changes.

Christos Pappas

On Tuesday, 12 July 2011 at 3:32 AM, joe1chen wrote:

Hi Christos, I think the changes for sharding support are ready now. The API is almost entirely backwards compatible with the original version. The major differences are: 1) The Activity.publish function returns an array of activities, rather than a single activity. 2) The options now support a :receiver argument in addition to the :receivers argument. The :receiver argument was added to support passing in just a single receiver (useful for publishing to self). 3) There is no longer an easy way to re-publish, so I've commented out those test methods (see activity_spec.rb).

All of the specs should be passing now with the exception of the republish test I mentioned above.

Joe

Reply to this email directly or view it on GitHub: https://github.com/christospappas/streama/pull/3

joe1chen commented 13 years ago

Hi Christos, Good point regarding the memory usage. The easiest way to address that issue for now is to not return the array of activities from Activity.publish. I made that change and updated the specs to pass. I also added my name to the credits section of the README (thanks for including me). I agree that we can probably drop into the driver to iterate through the receiver list. I'll take a quick look to see if there's an easy way to do this.

christospappas commented 13 years ago

I think we will need to drop to the driver to do these inserts. At the moment every time you create an activity it will load instances of the actor, object and target to cache. We should really only be doing this once per publish.

I'll try make some time next week to work through this.

joe1chen commented 13 years ago

Looks like you can batch insert by calling Activity.collection.insert(json) where json is the json for multiple documents.

I did a couple of benchmarks using batch insert method. Here's are the tests I added to activity_spec.rb:

describe 'batch insertion performance test' do

    max_batch_size = 500
    num_followers = 50000

    it "no batch insert" do

      (1..num_followers).each do |n|
        activity = Activity.new({:verb => :enquiry, :receiver => user, :actor => user, :object => enquiry, :target => listing})
      end
    end

    it "batch insert" do

      batch = []
      (1..num_followers).each do |n|
        activity = {}
        activity["actor"] = {}
        activity["actor"]["type"] = user.class.to_s
        activity["actor"]["id"] = user._id
        activity["actor"]["full_name"] = user.full_name

        activity["object"] = {}
        activity["object"]["type"] = enquiry.class.to_s
        activity["object"]["id"] = enquiry._id
        activity["object"]["comment"] = enquiry.comment

        activity["target"] = {}
        activity["target"]["type"] = listing.class.to_s
        activity["target"]["id"] = listing._id
        activity["target"]["title"] = listing.title

        activity["receiver"] = {}
        activity["receiver"]["id"] = receiver._id
        activity["receiver"]["type"] = receiver.class.to_s

        batch << activity

        if batch.size % max_batch_size == 0
          Activity.collection.insert(batch)
          batch = []
        end
      end

      Activity.collection.insert(batch)
    end

  end

And the results:

$ rspec spec --profile
..................................

Top 10 slowest examples:
  Activity batch insertion performance test no batch insert
    11.69 seconds ./spec/lib/activity_spec.rb:176
  Activity batch insertion performance test batch insert
    3.71 seconds ./spec/lib/activity_spec.rb:183

So a big improvement with the batch insert. I also played around with the batch size. Seems like 500 items was the sweet spot. When I set it to 50000 items, the insertion time jumped to about 7 seconds. When I decreased it to 200 items, the time also increased to around 5.6 seconds.

I haven't looked into the DSL code closely enough to figure out how to generate the JSON (including the cached fields) based on the activity's verb but once we figure that out then I think we're set.

I'm also assuming that the Ruby MongoDB driver's batch insert will properly work in a sharded environment. It looks like it should but I haven't tested it yet.

joe1chen commented 13 years ago

Hi Chris, The batch insert code has been checked in. When you get a chance, take a look at batch_insert in the activity.rb file. Just as an experiment, I also added the faststep gem (https://github.com/joshuaclayton/faststep), which is a C-extension on top of the C Mongo driver, and did the batch insert using faststep.

Top 10 slowest examples:
  Activity batch insertion performance test no batch insert
    15.42 seconds ./spec/lib/activity_spec.rb:237
  Activity batch insertion performance test batch insert
    6.35 seconds ./spec/lib/activity_spec.rb:244
  Activity batch insertion performance test batch insert with faststep
    5.19 seconds ./spec/lib/activity_spec.rb:282

So now inserting 50000 items, the batch insert is about 5 seconds with faststep, 6 seconds with the Mongo Ruby driver, and ~15 seconds using Mongoid (without batch).

Faststep looks promising but is very new, so I will probably stay away from it for now.

christospappas commented 13 years ago

Nice work! will take a look soon...

neiltron commented 12 years ago

Any chance of these changes being merged?

digitalplaywright commented 12 years ago

@neiltron @joe1chen Commenting is essential for many activity feeds. It seems like this generates one activity object per user. How would you support commenting on activity feed items with this schema change? In the current streama implementation it's trivial to just add an embedded collection for comments and re-publish. [Edited to include @joe1chen]

neiltron commented 12 years ago

Not sure which commit you're referring to. The only one I actually authored was adding the actor_activity_stream method. Maybe you meant to mention @joe1chen?

joe1chen commented 12 years ago

@digitalplaywright As with any change to denormalize data, it becomes more challenging to update data because data has been duplicated around in the system.

It is definitely not as trivial as adding an embedded collection for comments, although that can be the start:

class Activity
  include Streama::Activity
  embeds_many :comments

  activity :some_action do
    actor :user, :cache => [:username, :avatar]
    object :some_object, :cache => []
    receiver :user, :cache => []
  end
end

class Comment
  include Mongoid::Document
  include Mongoid::Timestamps

  field :name
  field :body 

  embedded_in :activity
end

Now when a user adds a comment to an activity, only that one activity for that user has been updated. We'd have to add a mechanism for updating all of the other activities for the followers. There are 2 parts to this:

1) We should have a unique identifier of some kind that allows us to select all the activities that need updating. I'm thinking that we could use a timestamp for this. So looking at the code right now, to support the mechanism above, we can change the :created_at timestamp to be the same when an activity is published (currently it is different for each user's activity).

2) We would add a method that selects all the activities based on this timestamp, and then performs the update. In the case of an embedded comments object, we would need to push the comment into each activity.

A rough example of how this would look if using the Ruby driver might look like:

Activity.collection.update({ "created_at" => timestamp }, {"$push"=>{"comments"=>{ "name" : "Joe", "body" : "This is my comment."}}, { :multi => true })

or

Activity.collection.update({ "created_at" => timestamp }, {"$push"=>new_comment.to_json}, { :multi => true })

I haven't tested the code above yet, but hopefully you get the idea.

In the meantime, I will make the timestamp change so that the above can be supported.

neiltron commented 12 years ago

:clap:

christospappas commented 12 years ago

Hi Joe/Neil,

So i've been looking through these changes and i'm not totally sure I want to pull them in yet, here are my reasons:

I agree that sharding is a good idea/necessary for large scale apps, however for most apps the existing structure works well...

I don't really know how others are using this gem so it would be good to get some more feedback from others before moving in this direction as it changes things a lot.