Potential race condition when sending logs to elasticsearch (?)

Hi,

I'm running a multi-threaded system where every thread logs a lot. Once in a while, i see that all my threads are frozen and I believe that it's related to logging:

A single connection is open to ElasticSearch in CLOSE_WAIT state .infinitely .
Threads blocked around the logging call
My threads log to both file and ES, however i examined both logs and i can see that a one single item from each thread is missing in ELK while i see those items in the log-file.
Usually when it happens i can see network issue or ES server-side issue on my ELK stack

Hence, my conclusion is that the problem is related to flushing the data to ES.

Now, i can see that the buffer is protected by a lock so probably there's no race condition there, However - i wonder if it's possible that the eshelpers.bulk is not thread-safe ? My theory is that due to a network failure (or timeout) the call to eshelpers.bulk takes a while and (since other threads continue to send logs) more and more threads call the eshelpers.bulk which is using the same ES client from multiple threads.... If the ES client is not thread-safe I believe that this might cause a deadlock.

Has anyone seen anything similar ?

Thanks a lot for the feedback.

Here is my view on the 3 theories that you pointed out: (a) race condition (b) deadlock (c) eshelpers.bulk not being thread-safe

There could be a race condition if two or more threads were getting into a unprotected shared code section and ended up causing either non-deterministic behaviour, corrupting one another; The locks are intended to serialise the access to the shared resources (buffers).
The lock is used only around a the buffer used for submission; a deadlock will imply two threads (or more) do not release resources that other threads are waiting for, which does not fit the case either.
Your third case, eshelpers.bulk; For The testing that I did at the time, "eshelpers.bulk" was thread safe. Thread safety are complex problems, they don't end up in a "my threads are locking and the application is slow". They end up in "my logs are messy (as in things ended up in the wrong buffers) and my application blows (as in the state of the whole things has gone into a really wrong state)"

Other options could be caused by congestion upon upload and ingestion of messages. As stated , if the operations take too long it can cause the critical sections to take longer which will slow down your threads. This you can play with by: (a) finding better values for the buffer_size and flush_frequency_in_sec (b) making sure the elastic cluster can ingest messages fast enough, sounds silly but that makes a huge difference. The point other users are making on supporting ingestion nodes eventually makes total sense to me (c) Considering using a Kafka handler rather than this implementation, so that buffering and scalability happens at that level... the old trick of adding one extra layer of abstraction.

Finally, if you are serious about helping with this I'd suggest creating a test case that reproduce the problem and can help others to debug what the actual problem is.

Regards, Carlos.

Hi Carlos, Thanks for the reply. First - I did try to reproduce it (locally) with dockerized ES and both multi process and multithreaded applications...unfortunately i couldn't. i get the deadlock only on my kubernetes cluster with hundred/thousands logs per second per thread.

Now, Indeed i'm planning to switch to a message bus for logging (RabbitMQ in my case) as i'll be able to better shape the traffic to ES and avoid hundreds of concurrent connections to ES.

Having said so, i'm still convinced that there is a race condition (my guess due to the underlying layer). The reasons that logs are not messed up is very simple - i believe that the deadlock is related to CLOSE_WAIT which means that the server closed . In addition, please note that the application is not "slow" as you've mention but completely frozen without consuming any CPU so it's definitely deadlock and not livelock.

I am still trying to debug and reproduce it of course . If time permits i will also try to use the (same?) lock to protect the eshelper.bulk as i have no idea if it's threadsafe. To be honest, i suspect that the thread-safety of the eshelper depends on the connection-type , and since we're using the RequestsHTTPConnection im not sure if it's thread-safe.

Anyhow, thanks again for your comments and great Python library !

cmanaha / python-elasticsearch-logger

Potential race condition when sending logs to elasticsearch (?) #56