casmlab / stack

The BITS Lab STACK tool for social media collection and analysis.
http://bits.ischool.syr.edu/
MIT License
1 stars 0 forks source link

Tweets aren't showing up in view or DB but are showing as processed #23

Closed libbyh closed 7 years ago

libbyh commented 7 years ago

When looking at the view below, new tweets are visible:

I checked the Beckett DB with the following, and didn't see them there either:

db.tweets.find().sort({ created_at: -1 }).limit(10)
db.tweets.find({id : 872809200711221248 })

However, when I check /mnt/stor2`/stack/data/potus45-5886bdea21e38564ac1ccfd8/twitter/archive (on Beckett), they show as processed:

20170609-16-potus-mentions-5886bdea21e38564ac1ccfd8-58e8408421e38520c599575e-tweets_out.json
20170609-16-potus-mentions-5886bdea21e38564ac1ccfd8-58e8408421e38520c599575e-tweets_out_processed.json

Files before May 31 are backed up on S3 now, so look there for older data.

libbyh commented 7 years ago

May be a duplicate of #25

libbyh commented 7 years ago

Using db.tweets.find().sort({ created_at: -1 }).limit(10)

The second-to-last tweet in the POTUS db is this one with a created_at of "Wed Mar 22 23:59:57 +0000 2017". The last tweet is form a deleted account.

Is this related to the de-dup code? Is there an insert problem? Not sure.

pratik27shah commented 7 years ago

the command to check has a typing error it should be id_str and not just id as id_str is the index used in the db

libbyh commented 7 years ago

Looks like we have an "id" field from the JSON and "_id" is the key Mongo creates for each unique record. Are _id and id equivalent?

There's an index on id_str, so can also use

db.tweets.find({ id_str : "872809200711221248" })

Similarly, the time index is on created_ts, so use

db.tweets.find().sort({ created_ts: -1 }).limit(10)

instead for faster results.

db.tweets.find({ id_str: "855709770870910976" }) is the last inserted tweet from April 22, 2017.

pratik27shah commented 7 years ago

the duplicate tweet issue is causing in problem during insertion

pratik27shah commented 7 years ago

https://stackoverflow.com/questions/6499268/mongodb-connection-refused

libbyh commented 7 years ago

Successfully restarted Mongo. Does it fix it?

On Jun 27, 2017, at 11:19 AM, Pratik Shah notifications@github.com wrote:

https://stackoverflow.com/questions/6499268/mongodb-connection-refused https://stackoverflow.com/questions/6499268/mongodb-connection-refused — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/casmlab/stack/issues/23#issuecomment-311409758, or mute the thread https://github.com/notifications/unsubscribe-auth/AALIzoqbLuVdqL0cbSNY26x1U4GYuCIiks5sISupgaJpZM4N1-Yj.

pratik27shah commented 7 years ago

added extra logging statement to view more detail flow steps

pratik27shah commented 7 years ago

new tweets are now displayed with proper updates

libbyh commented 7 years ago

On Waverly, the issue was

Traceback (most recent call last):
  File "__main__.py", line 448, in <module>
    c.process_command(command)
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 140, in process_command
    self.start()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 177, in start
    self.run()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 317, in run
    mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 136, in go
    queued_tweets_file_list = get_processed_tweet_file_queue(Config, insertdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 55, in get_processed_tweet_file_queue
    logger.info('RAW tweet list retrival completed %s' % tweetFileNamePattern)
NameError: global name 'logger' is not defined

in /.../stack/out/mil2-58e844bb21e38548ecb86364/std/mil2-insert-twitter-58e844bb21e38548ecb86364-stderr.txt

adding line with global logger to fix.