Use ES Round robbin scheduling

fmadio commented 5 years ago

using a round robbin scheduler for ES push allows for better ES utilization + adds redundancy should a single ES node fail.

fmadio commented 5 years ago

(from Nanji)

I have gone through the code & I guess I found the root case.

"IsReady"

This is set to "false" at init time.

From "Output_BufferAdd" function, once we have buffer to add then it look for the appropriate buffer by "put" index and also check "IsReady" must be set to "false". (Otherwise, it just keeps waiting)

https://github.com/fmadio/pcap2json/blob/287310555d9fc400eb4922a8e9bd14b5499f72bd/output.c#L906

Once it found, it will copy the buffer & will set "IsReady" to "true".

https://github.com/fmadio/pcap2json/blob/287310555d9fc400eb4922a8e9bd14b5499f72bd/output.c#L916

later...once output worker consumes the buffer and send it to the ES...and finally it set "IsReady" to "false" after sending process done...

https://github.com/fmadio/pcap2json/blob/287310555d9fc400eb4922a8e9bd14b5499f72bd/output.c#L760

Now, consider the case where one of the node if not working or not responding or responding slow or whatever...BulkUpload can fail...for example, connect failure case: https://github.com/fmadio/pcap2json/blob/287310555d9fc400eb4922a8e9bd14b5499f72bd/output.c#L528

So, "IsReady" is never getting set to "false" for that buffer and I guess "Output_BufferAdd" will keep waiting forever for that buffer (Since it's circular buffer concept, it will reach to the stuck point)

I am thinking of the following solutions:

Easy solution: In case of failure in "BulkUpload", ignore the case (irrespective of failure/success)...free-up that buffer...set "IsReady" to false, update BufferFin and other (error) fields if required. Additionally, we can mark that host as "not working" or something just to ignore for the next time. (In case of connect fails I guess)

Better solution: Buffers is defined as an array and we treat them linearly, which is not perfect way because time require for sending buffers will not be the same. So, even with above fix, we may see little starvation (other buffers could be already freed but the one we are waiting still not free). So, it will not completely fail but unnecessary wait...So, better way is to use different linked list.

Say for example: One for "free list"..."Output_BufferAdd" will dequeue buffers from free list and put it into 2nd list "available list". "Output_Worker" will dequeue buffers from "available list" and put it into 3rd list "processing list". "BulkUpload" will process the buffers from "processing list" and put it back to "free list"

But you have enough number of buffers (16k I guess), so "starvation" case should not happen easily !

navinsaven commented 5 years ago

Can we follow something like this for handling node failures? https://elasticsearch-py.readthedocs.io/en/master/connection.html#connection-pool

fmadio commented 5 years ago

Yes thats a good idea, need to confirm the persistent HTTP connections works first

fmadio commented 4 years ago

Persistent connections tracked on #22

fmadio commented 4 years ago

mereged, but keep open until customer has had a chance to test it out

fmadio commented 4 years ago

seems ok, closing

fmadio / pcap2json

Use ES Round robbin scheduling #9