Closed libbyh closed 7 years ago
May be a duplicate of #25
Using db.tweets.find().sort({ created_at: -1 }).limit(10)
The second-to-last tweet in the POTUS db is this one with a created_at
of "Wed Mar 22 23:59:57 +0000 2017". The last tweet is form a deleted account.
Is this related to the de-dup code? Is there an insert problem? Not sure.
the command to check has a typing error it should be id_str and not just id as id_str is the index used in the db
Looks like we have an "id" field from the JSON and "_id" is the key Mongo creates for each unique record. Are _id and id equivalent?
There's an index on id_str, so can also use
db.tweets.find({ id_str : "872809200711221248" })
Similarly, the time index is on created_ts
, so use
db.tweets.find().sort({ created_ts: -1 }).limit(10)
instead for faster results.
db.tweets.find({ id_str: "855709770870910976" })
is the last inserted tweet from April 22, 2017.
the duplicate tweet issue is causing in problem during insertion
Successfully restarted Mongo. Does it fix it?
On Jun 27, 2017, at 11:19 AM, Pratik Shah notifications@github.com wrote:
https://stackoverflow.com/questions/6499268/mongodb-connection-refused https://stackoverflow.com/questions/6499268/mongodb-connection-refused — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/casmlab/stack/issues/23#issuecomment-311409758, or mute the thread https://github.com/notifications/unsubscribe-auth/AALIzoqbLuVdqL0cbSNY26x1U4GYuCIiks5sISupgaJpZM4N1-Yj.
added extra logging statement to view more detail flow steps
new tweets are now displayed with proper updates
On Waverly, the issue was
Traceback (most recent call last):
File "__main__.py", line 448, in <module>
c.process_command(command)
File "/home/libbyh/github/casmlab/stack/app/controller.py", line 140, in process_command
self.start()
File "/home/libbyh/github/casmlab/stack/app/controller.py", line 177, in start
self.run()
File "/home/libbyh/github/casmlab/stack/app/controller.py", line 317, in run
mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 136, in go
queued_tweets_file_list = get_processed_tweet_file_queue(Config, insertdir)
File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 55, in get_processed_tweet_file_queue
logger.info('RAW tweet list retrival completed %s' % tweetFileNamePattern)
NameError: global name 'logger' is not defined
in /.../stack/out/mil2-58e844bb21e38548ecb86364/std/mil2-insert-twitter-58e844bb21e38548ecb86364-stderr.txt
adding line with global logger
to fix.
When looking at the view below, new tweets are visible:
I checked the Beckett DB with the following, and didn't see them there either:
However, when I check
/mnt/stor2`/stack/data/potus45-5886bdea21e38564ac1ccfd8/twitter/archive
(on Beckett), they show as processed:Files before May 31 are backed up on S3 now, so look there for older data.