MeltwaterArchive / datasift-connector

A set of components designed to retrieve data from third-party APIs and storage systems, and to pass that data in to a DataSift account.
http://datasift.com/
MIT License
9 stars 5 forks source link

When I restart the datasift-writer it resends items it already sent #46

Closed nwolff closed 9 years ago

nwolff commented 9 years ago

For version https://github.com/datasift/datasift-connector/releases/tag/1.0.15-1

the resent tweets then get sent to the gnip managed source in datasift and then get pushed further. Finally I get duplicate tweets in my application (exactly the same content but different interaction.interaction.id)

In /var/log/datasift/writer.log I can see

 (2015,08,11,18,08,56,(059))  INFO com.dat.con.DataSiftWriter:170 - Initialising Kafka consumer manager
 (2015,08,11,18,08,56,(066))  INFO com.dat.con.wri.SimpleConsumerManager:377 - Consumer connecting to zookeeper instance at localhost:2181
 (2015,08,11,18,08,56,(125))  INFO org.I0I.zkc.ZkEventThread:64 - Starting ZkClient event thread.
 (2015,08,11,18,08,56,(374))  WARN org.apa.zoo.ClientCnxnSocket:139 - Connected to an old server; r-o mode will be unavailable
 (2015,08,11,18,08,56,(376))  INFO org.I0I.zkc.ZkClient:449 - zookeeper state changed (SyncConnected)
 (2015,08,11,18,08,56,(377))  INFO com.dat.con.wri.SimpleConsumerManager:467 - Consumer looking up leader for twitter-gnip, 0 at localhost:6667
 (2015,08,11,18,08,57,(371))  INFO com.dat.con.wri.SimpleConsumerManager:367 - Consumer is connecting to lead broker <redacted>:6667 under client id Client_twitter-gnip_0                                                                                                           
 (2015,08,11,18,08,57,(482))  INFO com.dat.con.wri.SimpleConsumerManager:370 - Consumer is going to being reading from offset 0
 (2015,08,11,18,08,57,(482))  INFO com.dat.con.DataSiftWriter:173 - Initialising bulk uploads
dugjason commented 9 years ago

While we will certainly look into why these interactions are being sent again after a writer restart, it is worth mentioning that Gnip's Backfill functionality can automatically replay data which you may have missed during brief disconnections. This can certainly lead to duplicate data being pushed through the DataSift Connector. While DataSift can de-duplicate if the same interaction is sent twice in a rolling five minute window, if duplicate interactions are sent outside of this five minute window, it will be considered a new interaction, and passed on to your application. Also, DataSift's Push Delivery guarantees delivery at least once, so it is possible that you will receive some duplicated interactions. It is certainly worth adding some kind of unique key constraint or de-duplication within your application to prevent yourself from storing the same interaction twice. Something like the Tweet ID should work as a key you can de-duplicate on.

nwolff commented 9 years ago

Thank you for the clarifications Jason, it helps a lot.

drstrangelug commented 9 years ago

We've also seen this. We installed and configured the connector last Thursday morning and it worked fine. But when we restart the writer it goes back to Thursday morning.

A quick check of kafka shows the offset isn't being persisted properly

bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group writer writer twitter-gnip 0 7556 138176 130620 none

AndyJS commented 9 years ago

Thanks for the information above. We've investigated and reproduced this issue. It's correct that with certain configurations, offsets are not being committed correctly. This leads to an old offset being retrieved via an Offset Request after a restart. We're currently working towards a new tagged release, containing a new version of datasift-writer which will resolve this issue.

AndyJS commented 9 years ago

This has been resolved with the release of 1.0.44-1. Advice on how to handle the transition between connector instances has been included in the release notes.

https://github.com/datasift/datasift-connector/releases/tag/1.0.44-1