Stopping Logstash on Debian times out - Githubissues

elastic / logstash

Logstash - transport and process your logs, events, or other data

https://www.elastic.co/products/logstash

Other

14.2k stars 3.5k forks source link

Stopping Logstash on Debian times out #6937

Open PSiAU opened 7 years ago

PSiAU commented 7 years ago

Stopping logstash via "service logstash stop" times out with the error "logstash stop failed; still running." after only a few seconds. It appears Logstash is taking longer to shutdown than what the shutdown scripts wait for.

Version: 5.3.0
Operating System: Debian 7.11

jordansissel commented 7 years ago

If Logstash is not shutting down (the default stop timeout in our sysv init script is 5 seconds), this generally indicates Logstash is still busy with work and will terminate only after that work is done.

This is normal, especially if something downstream (elasticsearch, or any other output) is slow or down.

If you wish to kill the logstash process, accepting that you will lose data because of this kill proceedure, you can set KILL_ON_STOP_TIMEOUT=1 in your environment before telling Logstash to stop with the init script.

PSiAU commented 7 years ago

Ok. I can certainly increase the timeout myself, but in the interest of catering to a wider variety of environments, would it make sense to increase this timeout to 10 seconds? 10 works just fine in my setup. for i in 1 2 3 4 5 6 7 8 9 10; do

jordansissel commented 7 years ago

well, I don't know if there's a "good" timeout, and we don't have data to determine what a good timeout would be.

If logstash can't terminate within 5 seconds, it's likely stuck waiting for some external service to respond. I don't have data, but my assumption (without data) is that 5 seconds vs 10 seconds isn't going to show any meaningful difference in shutdown timeout behavior.

If we had data from a wider variety of systems that showed how long Logstash would take to shutdown, we could use that data to determine a good default value.

Can you try 10 seconds in your environment and report back? (Ideally reporting after having made several observations, not just one, but any data is better than none)

jordansissel commented 7 years ago

Alternately, would you prefer a "wait forever" option?

PSiAU commented 7 years ago

Logstash takes about 9 - 10 seconds to shutdown for me. I don't think waiting forever is a good idea though, so perhaps set it to 15 seconds to allow for a little more time.

jordansissel commented 7 years ago

@PSiAU that seems long. I'd like to understand why it takes so long to terminate.

PSiAU commented 7 years ago

This is all I see in my logs during shutdown:

[2017-04-21T16:27:30,341][WARN ][logstash.runner          ] SIGTERM received. Shutting down the agent.
[2017-04-21T16:27:30,350][WARN ][logstash.agent           ] stopping pipeline {:id=>"main"}
[2017-04-21T16:27:38,976][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{}}
[2017-04-21T16:27:38,977][ERROR][logstash.shutdownwatcher ] The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.

jordansissel commented 7 years ago

Can you run with --log.level=debug ?

jordansissel commented 7 years ago

And can you also attach your logstash configuration? (logstash.yml and any other configs)

PSiAU commented 7 years ago

Ok files attached. Debug output and my configs.

logstashconfigs.tar.gz

logstash shutdown debug output.zip

loopsysdev commented 7 years ago

I'm seeing the same issue on CentOS 7 (LS 1.5.4-1), on an idle test machine. A while kill $pid; do sleep 1; done hasn't been able to kill it for several minutes. Sample strace:

futex(0x7f0137eb99d0, FUTEX_WAIT, 1901, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=14664, si_uid=0} ---
futex(0x7f01395c6b00, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn()                          = 202
futex(0x7f0137eb99d0, FUTEX_WAIT, 1901, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=14664, si_uid=0} ---
futex(0x7f01395c6b00, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn()                          = 202

After stopping the while loop strace shows absolutely no activity. The configuration is otherwise stable and has been used unchanged in production for months.

nh2 commented 6 years ago

This is happening to me too when I'm trying to shut down a container in which both elasticsearch and logstash are running. It hangs until systemd kills it.

I suspect that it is because of this what @jordansissel said:

especially if something downstream (elasticsearch [...]) is [...] down.

I suspect that elasticsearch behaves well and shuts down at sigterm immediately, but then logstash still tries to talk to it, and hangs.

Is that possible?

PSiAU commented 6 years ago

@jordansissel Are you able to increase the timeout to 10 seconds for the moment?

EdgeSync commented 4 years ago

I didn't want to open a new issue for this, since there is already an open one here.

I have the same experience in Logstash 7.2.0, in that if logstash is busy, or stuck doing something, and I send a SIGTERM to it, either via service, or systemctl, they both just hang for 30+ seconds. Even attempting pkill logstash doesn't work.

I can recreate this easily if I shutdown Elasticsearch first, and Logstash continues to try and reconnect to the "dead es instance".

systemctl stop elasticsearch systemctl stop logstash hangs

logstash-plain.log fills with:

[2020-02-28T20:42:59,968][WARN ][logstash.outputs.elasticsearch][main] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://elastic:xxxxxx@localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://elastic:xxxxxx@localhost:9200/][Manticore::SocketException] Connection refused (Connection refused)"}

Eventually it will shut down, I've had it take over 2 minutes before. Possibly some condition in logstash which is not accepting the sigterm because it is doing other tasks?

Anyway, i've learned to live with it, and just shut logstash down first. Just wanted to throw in my two cents.

Kolano commented 4 years ago

This is happening to me too when I'm trying to shut down a container in which both elasticsearch and logstash are running. It hangs until systemd kills it.

I suspect that it is because of this what @jordansissel said:

especially if something downstream (elasticsearch [...]) is [...] down.

I suspect that elasticsearch behaves well and shuts down at sigterm immediately, but then logstash still tries to talk to it, and hangs.

Is that possible?

I had added...

[Unit]
After=elasticsearch.service

...to /etc/systemd/system/logstash.service.d/logstash.conf. Which in theory should force ElasticSearch to await Logstash to such down, and I still intermittently see this issue.

I've since added...

[Service]
TimeoutStopSec=300

...to hopefully force shutdown, as on Ubuntu 20.04 things hang up for a significantly long time >50m for me. The improper shutdown that will cause isn't ideal though.

Majsvaffla commented 3 years ago

I've been facing this issue periodically for months currently with Logstash 7.10 running on Debian buster. At the moment I've been waiting for Logstash to stop for 45 minutes after i ran systemctl stop logstash. I intrerrupted the command and tried to reboot the machine but it doesn't seem to come back up again. Then I issued a reboot from the EC2 console but not until I choose the "Stop instance" action it stopped after some more minutes.