Closed ginatrapani closed 13 years ago
I am going to start playing around with this (which certainly doesn't preclude others doing the same-- just let me know so we can coordinate).
[Just catching up] - How's the Site Streams experiment going? Dying to hear about it...
It's working great! It is so much fun to have running. I think I might have sent the following while you were on vacation: http://groups.google.com/group/thinkupapp/browse_thread/thread/1f1076fd3eb23f39/d7ee292bb7f93ce3#d7ee292bb7f93ce3
Since then, I have the script running pretty much all the time that my machine is awake :). It has only dropped the connection once. It seems quite solid. The script covers the same semantics that the regular thinkup script does, with I believe only one exception (see below). By 'same semantics', I mean that it adds the same post info to the database, updates the retweet and reply cache counts, adds the links to the links table, checks for and expands image URLs, checks for 'old-style' retweet syntax, etc.
The script does some additional stuff as well, e.g. storing the rest of the 'entities' information (hashtags, mentions), and the 'place urls', as described in that earlier msg.
It is also able to gather favorites information in both directions (not only storing what the user has favorited, but their tweets that others have favorited), and is able to gather some information with less effort than the crawler can. In particular it is scooping up all timeline retweets and replies, for both the user and the people they follow, that occur while it is tapped in. That makes it easy to gather friends' replies to each other (not just to the user).
The stream also provides info on list and friend/follower changes, and friend's deletes. I'm not doing anything with the follow & list information yet, but should. (if I see a delete of a post that's in the database I do print it out for my own amusement :).
One thing that's handled differently from the crawler, is setting 'in_retweet_of_post_id' for an 'old-style' RT. Unlike the crawler, this script does not make a set of REST API calls to see if it can can find the original post. However, it does instead set a new tu_posts value, 'in_rt_of_user', since this information is always available.
[It's an interesting design decision as to whether we want to slow down the stream processing with additional API calls and supplemental processing. I tend to think we should not, and instead have the crawler take that role. Also, as mentioned in my earlier msg, I think this is a good impetus to split the twitter crawler into decoupled subparts].
There's been no problem in terms of correctness (nor should there be) with running the stream collection and the crawler script at the same time, as detailed a bit more in that earlier msg. The model does make use of the 'unique' constraint so that both scripts can simply dump stuff in the database, ignoring cases where the item in question is already stored.
There is some duplication of effort on the crawler's part (especially if the stream collection were to be running nearly all the time). We can give more thought as to what bookkeeping is needed to let the two scripts work together a little more intelligently, but we can't assume the streaming script is running continuously. Note that the crawler is doing some 'depth-first' collection of stuff that the streaming script can't accumulate, so it's not redundant even if the streaming script does run all the time.
Some short-term next steps:
Once those basics are in place we can start thinking about the really fun stuff.
Btw, I've been v. busy again for a bit but should be MUCH more freed up in another few weeks, which hopefully should translate to a burst of productivity w/ThinkUp . [Freed up to the point where I suppose I should think about getting a 'proper' job again (sigh), but that is another story.].
Wow Amy, you've made so much progress! I'd love to take a look at the source. I know you're busy, but if you upload it to GitHub anywhere (even as just a gist) let me know. If not, no big deal.
As for the question about whether or not streaming should be the same plugin/app as the REST API, I think it probably should. Why not?
So exciting!
I would love to have you play with it/check it out (and of course I am always looking for opportunities to procrastinate on what I am supposed to be doing right now, which is not as fun). I will clean it up just a bit and upload it, probably later today.
It will require some database migrations (which I will include) and so you will want to start with a copy of your thinkup database. However, the crawler will be fine with the database changes, and so you can run a version of the crawler that points to the modified database if you like.
okay! Here is the code:
http://s3.amazonaws.com/aju_work/thinkup_streaming.zip
It's python. Start with the THINKUP_README.txt file in the top-level directory. It's not super well documented but I think that should point you to everything you need to do. Hopefully it should not take much to get it running.
(Don't be surprised by the number of files-- as you will see, I've included the tweepy lib for convenience). Do work off a copy of the database, not your 'real' db.
Please ask if you have any questions or see any issues. I am excited to have someone else try out this stuff too.
just fyi, for the php port I am going to start by playing w/ phirehose (http://code.google.com/p/phirehose/downloads/list) with its userstreams/oauth patch and see what that provides.
hi Gina (and anyone else interested),
Here is a bit of an update on the streaming port. It's going well. I've nearly finished the basics of the port of the original python app, modulo some tidying up and test-writing. I've made a number of design decisions in the process-- none that can't be undone though.
In an effort to keep the comments on this issue page from blowing up, I wrote more here: https://gist.github.com/e8b41b9a370eef77480b
update: oops, forgot to mention a major point in the link above. I'm getting the stream data in JSON, not XML, and wrote a JSON parser class for it. I believe Twitter is trying to move towards JSON exclusively in future. But, If this seems like a bad move, let me know.
(commenting here since you probably don't get pinged on updates to my gists)
Re: Redis (and other key-value store) alternatives, a couple other approaches would be:
One thing that might make sense is a config flag that indicates whether to use Redis, where if it is not set, use the sql database queue.
Also-- maybe memcached itself would be appropriate. We don't necessarily need this store to be persistent, since we are not using it for complete capture of tweets. The crawler will fill in the gaps and do a deeper search anyway. However, for those who do not have much control of their servers, memcached might be problematic too. A sql-db-based queue would work for everyone but would certainly increase the level of database activity.
Uh-oh, I don't think Anil sees this. I'll send to him... though maybe we should move the discussion to the mailing list?
Doing some issue tracker cleanup tonight. I'm going to close this one since we've moved the streaming plugin discussion to a newer issue and onto the dev mailing list.
Push vs pull updates for users became available on 8/30/2010:
http://dev.twitter.com/pages/site_streams