Closed zferentz closed 6 years ago
Thanks a lot for the feedback.
Here is my view on the 3 theories that you pointed out: (a) race condition (b) deadlock (c) eshelpers.bulk not being thread-safe
Other options could be caused by congestion upon upload and ingestion of messages. As stated , if the operations take too long it can cause the critical sections to take longer which will slow down your threads. This you can play with by: (a) finding better values for the buffer_size
and flush_frequency_in_sec
(b) making sure the elastic cluster can ingest messages fast enough, sounds silly but that makes a huge difference. The point other users are making on supporting ingestion nodes eventually makes total sense to me (c) Considering using a Kafka handler rather than this implementation, so that buffering and scalability happens at that level... the old trick of adding one extra layer of abstraction.
Finally, if you are serious about helping with this I'd suggest creating a test case that reproduce the problem and can help others to debug what the actual problem is.
Regards, Carlos.
Hi Carlos, Thanks for the reply. First - I did try to reproduce it (locally) with dockerized ES and both multi process and multithreaded applications...unfortunately i couldn't. i get the deadlock only on my kubernetes cluster with hundred/thousands logs per second per thread.
Now, Indeed i'm planning to switch to a message bus for logging (RabbitMQ in my case) as i'll be able to better shape the traffic to ES and avoid hundreds of concurrent connections to ES.
Having said so, i'm still convinced that there is a race condition (my guess due to the underlying layer). The reasons that logs are not messed up is very simple - i believe that the deadlock is related to CLOSE_WAIT which means that the server closed . In addition, please note that the application is not "slow" as you've mention but completely frozen without consuming any CPU so it's definitely deadlock and not livelock.
I am still trying to debug and reproduce it of course . If time permits i will also try to use the (same?) lock to protect the eshelper.bulk
as i have no idea if it's threadsafe. To be honest, i suspect that the thread-safety of the eshelper
depends on the connection-type , and since we're using the RequestsHTTPConnection
im not sure if it's thread-safe.
Anyhow, thanks again for your comments and great Python library !
Hi,
I'm running a multi-threaded system where every thread logs a lot. Once in a while, i see that all my threads are frozen and I believe that it's related to logging:
Hence, my conclusion is that the problem is related to flushing the data to ES.
Now, i can see that the buffer is protected by a lock so probably there's no race condition there, However - i wonder if it's possible that the
eshelpers.bulk
is not thread-safe ? My theory is that due to a network failure (or timeout) the call toeshelpers.bulk
takes a while and (since other threads continue to send logs) more and more threads call theeshelpers.bulk
which is using the same ES client from multiple threads.... If the ES client is not thread-safe I believe that this might cause a deadlock.Has anyone seen anything similar ?