Closed nwolff closed 9 years ago
While we will certainly look into why these interactions are being sent again after a writer restart, it is worth mentioning that Gnip's Backfill functionality can automatically replay data which you may have missed during brief disconnections. This can certainly lead to duplicate data being pushed through the DataSift Connector. While DataSift can de-duplicate if the same interaction is sent twice in a rolling five minute window, if duplicate interactions are sent outside of this five minute window, it will be considered a new interaction, and passed on to your application. Also, DataSift's Push Delivery guarantees delivery at least once, so it is possible that you will receive some duplicated interactions. It is certainly worth adding some kind of unique key constraint or de-duplication within your application to prevent yourself from storing the same interaction twice. Something like the Tweet ID should work as a key you can de-duplicate on.
Thank you for the clarifications Jason, it helps a lot.
We've also seen this. We installed and configured the connector last Thursday morning and it worked fine. But when we restart the writer it goes back to Thursday morning.
A quick check of kafka shows the offset isn't being persisted properly
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group writer writer twitter-gnip 0 7556 138176 130620 none
Thanks for the information above. We've investigated and reproduced this issue. It's correct that with certain configurations, offsets are not being committed correctly. This leads to an old offset being retrieved via an Offset Request after a restart. We're currently working towards a new tagged release, containing a new version of datasift-writer which will resolve this issue.
This has been resolved with the release of 1.0.44-1. Advice on how to handle the transition between connector instances has been included in the release notes.
https://github.com/datasift/datasift-connector/releases/tag/1.0.44-1
For version https://github.com/datasift/datasift-connector/releases/tag/1.0.15-1
the resent tweets then get sent to the gnip managed source in datasift and then get pushed further. Finally I get duplicate tweets in my application (exactly the same content but different interaction.interaction.id)
In /var/log/datasift/writer.log I can see