casmlab / stack

The BITS Lab STACK tool for social media collection and analysis.
http://bits.ischool.syr.edu/
MIT License
1 stars 0 forks source link

Dedupe database script #16

Closed libbyh closed 7 years ago

libbyh commented 7 years ago

Twitter's "follow" and "track" API filters return two copies of replies. So, we should have a script that runs regularly to dedupe the database. It should make sure there are unique 'id_str'.

jhemsley commented 7 years ago

Really? Do you mean if you use both at the same time going into the same db, or do you mean if just run a follow you get dup replies? Always, or just sometimes?

By the way, when you index with mongo you can tell it to drop dups and it will do it for you.

libbyh commented 7 years ago

Yeah, when you run both into the same database you get dupes - e.g., "follow" realDonaldTrump and "track" realDonaldTrump show 2 copies of replies

On Wed, Apr 12, 2017 at 1:40 PM, Jeff Hemsley notifications@github.com wrote:

Really? Do you mean if you use both at the same time going into the same db, or do you mean if just run a follow you get dup replies? Always, or just sometimes?

By the way, when you index with mongo you can tell it to drop dups and it will do it for you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/casmlab/stack/issues/16#issuecomment-293670325, or mute the thread https://github.com/notifications/unsubscribe-auth/AALIzkg1tlIpNeS0EAChHC5zfy1QX0iqks5rvRq3gaJpZM4M7zxV .

jhemsley commented 7 years ago

Got it. Yeah, that makes sense. Yeah, we periodically index on id_str with drop dups to fix dup problems. You do get dups 'naturally' sometimes even with just one collector. Twitter api has some minor bugs it seems.

libbyh commented 7 years ago

Drop dups isn't available in mongo 3.0+, so we're trying something like

  1. Create collection dump with mongodump
  2. Clear collection
  3. Add unique index
  4. Restore collection with mongorestore
pratik27shah commented 7 years ago

db.tweets.aggregate({$group:{_id:{id_str:"$id_str"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}) returns all duplicates having more then one time the same id_Str(tweet id) to find duplicates in mongodb

pratik27shah commented 7 years ago

implemented for rogue,potus,mil,the backup of theses databases is stored in mnt/stack2/dbbackup282017, works fine all duplicates are removed