cmanaha / python-elasticsearch-logger

Python Elasticsearch handler for the standard python logging framework
Other
232 stars 116 forks source link

Potential race condition when sending logs to elasticsearch (?) #56

Closed zferentz closed 6 years ago

zferentz commented 6 years ago

Hi,

I'm running a multi-threaded system where every thread logs a lot. Once in a while, i see that all my threads are frozen and I believe that it's related to logging:

Hence, my conclusion is that the problem is related to flushing the data to ES.

Now, i can see that the buffer is protected by a lock so probably there's no race condition there, However - i wonder if it's possible that the eshelpers.bulk is not thread-safe ? My theory is that due to a network failure (or timeout) the call to eshelpers.bulk takes a while and (since other threads continue to send logs) more and more threads call the eshelpers.bulk which is using the same ES client from multiple threads.... If the ES client is not thread-safe I believe that this might cause a deadlock.

Has anyone seen anything similar ?

cmanaha commented 6 years ago

Thanks a lot for the feedback.

Here is my view on the 3 theories that you pointed out: (a) race condition (b) deadlock (c) eshelpers.bulk not being thread-safe

Other options could be caused by congestion upon upload and ingestion of messages. As stated , if the operations take too long it can cause the critical sections to take longer which will slow down your threads. This you can play with by: (a) finding better values for the buffer_size and flush_frequency_in_sec (b) making sure the elastic cluster can ingest messages fast enough, sounds silly but that makes a huge difference. The point other users are making on supporting ingestion nodes eventually makes total sense to me (c) Considering using a Kafka handler rather than this implementation, so that buffering and scalability happens at that level... the old trick of adding one extra layer of abstraction.

Finally, if you are serious about helping with this I'd suggest creating a test case that reproduce the problem and can help others to debug what the actual problem is.

Regards, Carlos.

zferentz commented 6 years ago

Hi Carlos, Thanks for the reply. First - I did try to reproduce it (locally) with dockerized ES and both multi process and multithreaded applications...unfortunately i couldn't. i get the deadlock only on my kubernetes cluster with hundred/thousands logs per second per thread.

Now, Indeed i'm planning to switch to a message bus for logging (RabbitMQ in my case) as i'll be able to better shape the traffic to ES and avoid hundreds of concurrent connections to ES.

Having said so, i'm still convinced that there is a race condition (my guess due to the underlying layer). The reasons that logs are not messed up is very simple - i believe that the deadlock is related to CLOSE_WAIT which means that the server closed . In addition, please note that the application is not "slow" as you've mention but completely frozen without consuming any CPU so it's definitely deadlock and not livelock.

I am still trying to debug and reproduce it of course . If time permits i will also try to use the (same?) lock to protect the eshelper.bulk as i have no idea if it's threadsafe. To be honest, i suspect that the thread-safety of the eshelper depends on the connection-type , and since we're using the RequestsHTTPConnection im not sure if it's thread-safe.

Anyhow, thanks again for your comments and great Python library !