casmlab / stack

The BITS Lab STACK tool for social media collection and analysis.
http://bits.ischool.syr.edu/
MIT License
1 stars 0 forks source link

mongo insert stops when duplicate encountered #33

Open libbyh opened 7 years ago

libbyh commented 7 years ago

Here's an example from the mil2 project:

Traceback (most recent call last):
  File "__main__.py", line 448, in <module>
    c.process_command(command)
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 144, in process_command
    self.restart()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 304, in restart
    self.start()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 177, in start
    self.run()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 317, in run
    mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 214, in go
    inserted_ids_list = insert_tweet_list(insert_db, tweets_list, line_number, processedTweetsFile, data_db)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 67, in insert_tweet_list
    inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/collection.py", line 410, in insert
    _check_write_command_response(results)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _check_write_command_response

Should just gracefully skip the duplicate instead

See /.../stack/out/mil2-58e844bb21e38548ecb86364/std/mil2-insert-twitter-58e844bb21e38548ecb86364-stderr.txt

stanupab commented 6 years ago

I think the inserter actually skips the duplicates. Could you please paste the actual error message (the last line of traceback)?

This is from the error log of one of our projects:

Traceback (most recent call last): File "main.py", line 448, in c.process_command(command) File "/home/bits/stack/app/controller.py", line 144, in process_command self.restart() File "/home/bits/stack/app/controller.py", line 304, in restart self.start() File "/home/bits/stack/app/controller.py", line 177, in start self.run() File "/home/bits/stack/app/controller.py", line 317, in run mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir) File "/home/bits/stack/app/twitter/mongoBatchInsert.py", line 228, in go inserted_ids_list = insert_tweet_list(deleteCollection, deleted_tweets_list, line_number, processedTweetsFile, delete_db) File "/home/bits/stack/app/twitter/mongoBatchInsert.py", line 66, in insert_tweet_list inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True) File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 409, in insert gen(), check_keys, self.uuid_subtype, client) File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 1111, in _send_message sock_info = self.__socket(member) File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 919, in __socket "%s %s" % (host_details, str(why))) pymongo.errors.AutoReconnect: could not connect to localhost:27017: [Errno 111] Connection refused

libbyh commented 6 years ago
Traceback (most recent call last):
  File "__main__.py", line 448, in <module>
    c.process_command(command)
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 144, in process_command
    self.restart()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 304, in restart
    self.start()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 177, in start
    self.run()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 317, in run
    mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 214, in go
    inserted_ids_list = insert_tweet_list(insert_db, tweets_list, line_number, processedTweetsFile, data_db)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 67, in insert_tweet_list
    inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/collection.py", line 410, in insert
    _check_write_command_response(results)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _check_write_command_response
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: potus45_5886bdea21e38564ac1ccfd8.tweets index: id_str_1 dup key: { : "931464828660715521" }
stanupab commented 6 years ago

Did you create a unique index on this field?

You can get the info by using index_information() http://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.index_information

libbyh commented 6 years ago

Yes, we have a couple of unique indices set so that we don't keep throwing in dups.

pratik27shah commented 6 years ago

Hi everyone,I remember of adding a unique key in Mongo DB to avoid duplicate entries of tweet,and the duplicate tweets were removed,when I was working in april

stanupab commented 6 years ago

I see.

I could not test this myself as none of our collections had any unique index defined. Could you add this right after line 77 of MongobatchInsert.py?

except pymongo.errors.DuplicateKeyError, e: print "Exception during mongo insert" logger.warning("Duplicate error during mongo insert at or before file line number %d (%s)" % (line_number, processedTweetsFile)) logging.exception(e) print traceback.format_exc() pass

libbyh commented 6 years ago

I'm not running any right now but will try to get to this before I talk to you on Monday.