Connection reset by peer / broken pipeline

lkosz commented 8 years ago

I'm testing log-courier 2.0.4 (from official rpm package) and:

log-courier shipped logs from single busy file without problems
after turn on shipping logs from all log files on testing host, ~5s normal log shipping and then stop. Here is what I see in logs:

in log-courier with debug level http://pastebin.com/mG6KmfgM (and is waiting...)

logstash STDOUT (started with debug): http://pastebin.com/zTAFUuKP

and logs from /var/log/logstash/cache.log http://pastebin.com/MJJRz4HF

logstash config: http://pastebin.com/Jxj01y0q

log-courier config: http://pastebin.com/E7YL1znH and file definition (configuration is handled by puppet so it is similar for other files): http://pastebin.com/i2P1tKAn

Next I set in logstash to drop all events - the same: broken pipe, connection reset by peer, but I had to wait longer (2-5 min). After logstash restart log-courier wait some time before reconnect (as expected) and reported problem with connection. Next, after log-courier process kill and start again, the same.

Is there planned support for redis? It would be great if so, because log-courier is in my opinion best log shipper now (with comparison to filebeat and beaver) and lack of redis support is its biggest disadvantage.

driskell commented 8 years ago

What is your configuration? Also, if you have the full complete logs from start to finish that helps, it looks like you only uploaded snippets.

Regarding Redis, now that Log Courier connects to multiple indexers it generally does not need this form of buffer. I have a cluster at the moment where hundreds of machines send logs into a couple of balanced indexers. The protocol works really well in avoiding timeouts and maximising resource usage (I run Logstash at almost 100% for several parts of the day.) There may be other use cases for Redis though and I'm happy to hear - but currently I do not plan to add it as it moves courier away from it's current guarantees of at-least-once delivery since Redis is not really for non-volatile storage.

lkosz commented 8 years ago

I've already attached config files and logs in first post :)

I see 3 major advantages of redis output in log-courier:

Log events are shipped to ELK cluster as fast as possible. Then you have all logs in one place and time (indexed or not). Time of being stored in ES depends (with simplification) on number of events and number of indexers. You have one place where you can check if you can rely on things which kibana shows. When you are using log files as a "natural cache" you can't easily say that all logs are stored in ES and this "cache" is being deleted during log rotation when log-courier wait too long)
Currently I'm using following ELK scheme: [log-shipper] -> [logstash cache] -> [redis] -> [logstash indexer] -> [redis] -> [logstash writer] -> ES There are 3 logstash-cache, 6 logstash-indexer and 3 logstash-writer instances, 3 redis cache queues and 3 redis writer queues (simple standalone redis instances). Every logstash instance can connect to every redis instance. During incident or sth. I can easily increase "indexing power" by increasing number of indexers (and writers) and it can be done automatically - automaton may observe redis queues and when they are too big (about 100-200k), it attaches new indexer/writer instance, when queues are low/empty, turns off additional instances (cost saving). In your case it is impossible - you have list of logstash indexers defined in log-courier configuration. Of course you can attach new hosts, but it takes a long time, even using puppet automaton or sth. else
Make independent from performance limitations in logstash. Redis is quick, lightweight and stable (which I can't say about logstash - it's nice but not quick). At present I'm trying to replace beaver with something better. Filebeat has a performance bug and sends old logs after restart. Log-courier - own problems.

I agree that redis is volatile 'method of data storage' :) You can avoid data loss using clustered redis or partially by doing dumps. On the other hand logstash also has internal cache (and we can loose some data when process fail) and finally ElasticSearch doesn't assure that 100% of our data will be kept for ever.

driskell commented 8 years ago

The attached log files in pastebin are incomplete. The Log Courier one for example is only 35 lines so lots of information about when things were queued and which thing timed out and why is missing.

I only just noticed the config. I'm sorry!

Your network timeout is too low. It breaks the protocol (this is my fault - sorry - it could be better handled.) The 2 second expires before Logstash can finish processing the payload. The plugin at Logstash is getting rewritten as we speak but currently it only keep alive every 8 seconds so the timeout on Log Courier must really be at least 12 seconds - there is little reason to change it from the default of 15 though so maybe just remove that.

lkosz commented 8 years ago

I've increased timeout to 30s and... it works :) but it seems to have some performance problems... logs from one heavily-writed file (50-55 lines/s) are shipped with lag. In lc-admin status I see that it is completed in 75%

driskell commented 8 years ago

Yes if Logstash can't keep up with the rate of logs then it will never get to the end of the file if they come in faster than they can be processed. Maybe you'll need more Logstash instances or some tuning on the Logstash side to increase throughput.

driskell commented 8 years ago

I'm also considering redis implementation. I was also preparing the code layout to maybe write a bridge-courier of some sort which simply receives events and forwards them. It;s something that simplifies part of my setup anyway, and it would also mean a single point to update and reload with new logstash instances. Kind of like a mini-ELB on application level. Both might be useful. Though I'm constrained on time with other personal things at the moment and need to finish latest version of log-courier/fact-courier first really!

lkosz commented 8 years ago

I separated logstash and elasticsearch, and they are now on different nodes (3 ES nodes, 6 Logstash nodes) and it works ok. After heavy tests I've noticed that redis cause delays on some of the logs (5-15%). There was also problem with HA and clustering - logstash doesn't write to redis in round-robin way (I had 6 not clustered redis instances). It connects to random host so most of times I had problem that one or two redis processes were under heavy load, other 4 did nothing. Logstash input require host to be a string so I had to multiply inputs, each per one host (which is not a problem when config is handled by puppet) but problem begins when cache instances write to 1 or 2 redis instances. Logstash doesn't read quicker from two loaded redis instances when see that other 4 don't have events to process. So I decided to use RabbitMQ. After migration to rabbit cluster it works ok, very smoothly. Logs are present in Kibana much quicker, but there seems to be some performance problems in logstash output plugin. After simulated cluster failure logs are shipped by log-courier a bit slowly. Indexing and writing events to ES is not a problem, because I could run more indexer/writer instances and simply increase cluster performance. Problem is on input. Simple tool which gets events from log-courier and passes them to redis/rabbit would be nice. Logstash is too ponderous in such use case. Tomorrow I'll make stress tests (all apps under heavy load logging in debug) and I'll measure performance and stability of whole cluster

lkosz commented 8 years ago

I've end up tests. Unfortunately using logstash with rabbitmq as output cause large impact on logs shipping performance. 5 caching instances which reads from log-courier and writes to rabbitmq can receive 4k events/s tops. With file as output 30k. Tool which can replace such logstash instance would be very useful :)

driskell commented 8 years ago

I will close this issue but I won't forget the discussions. I hope to get some time soon to get out a new revision of things!

driskell / log-courier

Connection reset by peer / broken pipeline #334