Open libbyh opened 7 years ago
I think the inserter actually skips the duplicates. Could you please paste the actual error message (the last line of traceback)?
This is from the error log of one of our projects:
Traceback (most recent call last):
File "main.py", line 448, in
Traceback (most recent call last):
File "__main__.py", line 448, in <module>
c.process_command(command)
File "/home/libbyh/github/casmlab/stack/app/controller.py", line 144, in process_command
self.restart()
File "/home/libbyh/github/casmlab/stack/app/controller.py", line 304, in restart
self.start()
File "/home/libbyh/github/casmlab/stack/app/controller.py", line 177, in start
self.run()
File "/home/libbyh/github/casmlab/stack/app/controller.py", line 317, in run
mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 214, in go
inserted_ids_list = insert_tweet_list(insert_db, tweets_list, line_number, processedTweetsFile, data_db)
File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 67, in insert_tweet_list
inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/collection.py", line 410, in insert
_check_write_command_response(results)
File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _check_write_command_response
raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: potus45_5886bdea21e38564ac1ccfd8.tweets index: id_str_1 dup key: { : "931464828660715521" }
Did you create a unique index on this field?
You can get the info by using index_information() http://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.index_information
Yes, we have a couple of unique indices set so that we don't keep throwing in dups.
Hi everyone,I remember of adding a unique key in Mongo DB to avoid duplicate entries of tweet,and the duplicate tweets were removed,when I was working in april
I see.
I could not test this myself as none of our collections had any unique index defined. Could you add this right after line 77 of MongobatchInsert.py?
except pymongo.errors.DuplicateKeyError, e: print "Exception during mongo insert" logger.warning("Duplicate error during mongo insert at or before file line number %d (%s)" % (line_number, processedTweetsFile)) logging.exception(e) print traceback.format_exc() pass
I'm not running any right now but will try to get to this before I talk to you on Monday.
Here's an example from the mil2 project:
Should just gracefully skip the duplicate instead
See
/.../stack/out/mil2-58e844bb21e38548ecb86364/std/mil2-insert-twitter-58e844bb21e38548ecb86364-stderr.txt