chiwanpark / flume-ng-redis

Redis extension for Flume NG
26 stars 20 forks source link

Exist data loss when flume-ng-redis stop or restart #13

Open lovemelovemycode opened 9 years ago

lovemelovemycode commented 9 years ago

publish ---->redis---->flume-ng-redis(source) when the flume-ng-redis(source) stop for some reason,the data will lose during this period。

chiwanpark commented 9 years ago

Currently, there is no perfect solution for this case. Flume can be configured with multiplexing but the data will be replicated with multiplexing configuration. After implementation using Redis List structure, we can solve this problem.

lovemelovemycode commented 9 years ago

https://github.com/fengpeiyuan/flumeng-plugins-redis 1 This example solve the problem with data lose 2 But I don`t know which is faster compared to publish/subscribe example

chiwanpark commented 9 years ago

Yes. Using redis list structure can solve the problem. And I just implemented the plugin using list structure also. You can use it from master branch.

You can solve the problem using multiple subscriber with pub/sub implementation. But there are duplicated records in multiple subscription. You can deal with the duplication in the stage before using the collected data as known as ETL.

I think pub/sub is faster than list but the list structure is enough in common case. I attached a article about this. (https://davidmarquis.wordpress.com/2013/01/03/reliable-delivery-message-queues-with-redis/)

lovemelovemycode commented 9 years ago

I have tested the plugin using list structure,2M/S,11700 record/S.Maybe we shouled make it faster. lpush---->list named event-current---->lpop every 5 minute | every 5 minute | every 5 minute | rename the event-current to event-yyyyddMMHHmm ---->lrange + del

chiwanpark commented 9 years ago

Hi! Thanks for your effort to test. :) But I cannot understand your suggestion perfectly. Do you mean that processing the events in small batches should be faster? It sounds reasonable but in some cases, sending the events immediately should be better than the small batches. (Also I think using current time as a part of list name is not good idea. Using a atomic counter in Redis as a list number is better.)

I'll add this feature as a option. But I'm preparing my final exam in school now. Maybe I can implement this feature within 2-3 weeks. Thank you.