Closed libbyh closed 7 years ago
Really? Do you mean if you use both at the same time going into the same db, or do you mean if just run a follow you get dup replies? Always, or just sometimes?
By the way, when you index with mongo you can tell it to drop dups and it will do it for you.
Yeah, when you run both into the same database you get dupes - e.g., "follow" realDonaldTrump and "track" realDonaldTrump show 2 copies of replies
On Wed, Apr 12, 2017 at 1:40 PM, Jeff Hemsley notifications@github.com wrote:
Really? Do you mean if you use both at the same time going into the same db, or do you mean if just run a follow you get dup replies? Always, or just sometimes?
By the way, when you index with mongo you can tell it to drop dups and it will do it for you.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/casmlab/stack/issues/16#issuecomment-293670325, or mute the thread https://github.com/notifications/unsubscribe-auth/AALIzkg1tlIpNeS0EAChHC5zfy1QX0iqks5rvRq3gaJpZM4M7zxV .
Got it. Yeah, that makes sense. Yeah, we periodically index on id_str with drop dups to fix dup problems. You do get dups 'naturally' sometimes even with just one collector. Twitter api has some minor bugs it seems.
Drop dups isn't available in mongo 3.0+, so we're trying something like
db.tweets.aggregate({$group:{_id:{id_str:"$id_str"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}) returns all duplicates having more then one time the same id_Str(tweet id) to find duplicates in mongodb
implemented for rogue,potus,mil,the backup of theses databases is stored in mnt/stack2/dbbackup282017, works fine all duplicates are removed
Twitter's "follow" and "track" API filters return two copies of replies. So, we should have a script that runs regularly to dedupe the database. It should make sure there are unique 'id_str'.